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ABSTRACT 


To date, the most common form of simulators of computer systems are software-based running 
on standard computers. One promising approach to improve simulation performance is to ap- 
ply hardware, specifically reconfigurable hardware in the form of field programmable gate arrays 
(FPGAs). This manuscript describes various approaches of using FPGAs to accelerate software- 
implemented simulation of computer systems and selected simulators that incorporate those tech- 
niques. More precisely, we describe a simulation architecture taxonomy that incorporates a sim- 
ulation architecture specifically designed for FPGA accelerated simulation, survey the state-of- 
the-art in FPGA-accelerated simulation, and describe in detail selected instances of the described 
techniques. 
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Preface 


The slow speed of accurate computer system simulators is a significant bottleneck in the study 
of computer architectures and the systems built around them. Since computers use considerable 
hardware parallelism to obtain the performance they achieve, it is difficult for simulators to track 
the performance of such systems without utilizing hardware parallelism as well. The significant 
capabilities and the flexibility of FPGAs make them an ideal vehicle for accelerating and address- 
ing the challenges of computer system simulation. 

This book presents the current state-of-the art in the use of Field Programmable Gate 
Arrays (FPGAs) to improve the speed of accurate computer system simulators. The described 
techniques and the related work are the result of active research in this area by the authors and 
others over the last ten years. 

Chapters 3 and 4 present solutions that address the major challenges in building fast and 
accurate FPGA-accelerated simulators without undue implementation effort and cost. Chap- 
ter 3 describes how FPGA acceleration can be applied to different simulator architectures, while 
Chapter 4 describes how virtualization, via simulator multithreading and transplanting simula- 
tion activity back to software, can be used to further extend the performance and capabilities of 
FPGA-accelerated simulators. 

Chapter 2 describes simulator architectures that apply not only to FPGA-accelerated sim- 
ulators, but also to pure software simulators and pure FPGA simulators. Four of the simulator 
architectures (monolithic, functional-first, timing-directed, and timing-first) are well known in 
the literature, while the fifth, speculative functional-first (Section 2.6.5) was an architecture that 
we developed specifically to be accelerated by FPGAs. In addition to a survey of related work in 
Chapter 5, the Appendix provides a brief introduction to FPGA technologies. 
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CHAPTER 1 


Introduction 


1.4p OVERVIEW 


All scientific and engineering disciplines are supported by models that enable the experimenta- 
tion, prediction, and testing of hypotheses and designs. Modeling can be performed in a variety of 
ways, ranging from mathematical equations (for example, using differential equations to model 
circuits), to miniatures (for example, using a scale model of a car in a wind tunnel to estimate 
drag), and to software-implemented algorithms (for example, using financial models to predict 
stock prices). Simulation, which is often considered a form of modeling itself, applies a model 
repetitively to predict behavior over simulated time. For example, one can simulate climate change 
over one hundred years, or simulate the lifecycle of a star. 

The technique of simulation is commonly used to guide the design of future computer 
systems, or to better understand the behaviors of existing ones. Computer system simulation 
enables the prediction of many different behaviors without the explicit need to build the system 
itself. Ihe most commonly predicted behavior of a computer system is performance, often in 
terms of the number of cycles it takes to execute a sequence of instructions. Two other commonly 
predicted behaviors are energy/power consumption and reliability in the presence of faults. 

Computer system simulators are typically implemented in software due to the need for flex- 
ibility and visibility. As computers have grown faster over time, their ability to simulate natural 
phenomenon, such as the weather, has grown faster as well. Though our understanding may im- 
prove, the inherent complexity of the lifecycle of a star, overall, does not increase over time. The 
inherent complexity of computer systems, however, continues to advance rapidly over time—in 
fact, at a rate faster than the growth in computer performance. Thus, the performance of computer 
simulation is ever decreasing relative to the next generation computer being simulated. This phe- 
nomenon is known as the simulation wall. Paradoxically, as computer designers have improved 
virtually all other fields’ simulation capabilities, they have simultaneously reduced the ability to 
simulate their next-generation designs. 


12 HOST VS. TARGET TERMINOLOGY 


When computers are used to simulate other computers, differentiated terms are necessary to avoid 
confusion. Thus, in this manuscript, the computer system being simulated will be referred to as the 
target system. The term Zarget-correct behavior will refer to how the target system would behave 
in an actual implementation. For example, if the target mispredicts a branch, the instructions that 
are fetched by the target are considered wrong-path instructions—but are target-correct. 


2 1. INTRODUCTION 


A simulator is executed on a /ost platform. For example, when a simulator is implemented 
in software, the computer that runs the software is deemed to be the actual host. It is also possible 
to implement the simulator—or portions of the simulator—directly in custom hardware, in which 
case the custom hardware would be considered the host. 


13 WHY ARE FAST, ACCURATE SIMULATORS OF 
COMPUTER TARGETS NEEDED? 


Fast and accurate simulators provide a vehicle for the rapid exploration of microprocessor de- 
signs. Today, most low hanging microprocessor improvements have already been implemented, 
forcing architects to consider more complex mechanisms with the ever decreasing likelihood of 
commensurate returns. In addition, many new ideas are based on learning algorithms that require 
long training periods before the mechanisms become effective. Thus, a simulator must not only 
be sufficiently detailed to accurately evaluate the proposed architecture, it must offer sufficient 
performance to run long enough to warm up such proposed mechanisms, in order for the user to 
arrive at the correct conclusions. Since microprocessor designs routinely cost hundreds of millions 
of dollars, the ability to accurately evaluate a proposed microprocessor can save on substantial costs 
that otherwise would be wasted on designs that do not offer improvements. 

The advent of multicore processors and increased emphasis on parallelism has led to an in- 
creasingly diverse set of computer architecture ideas under study. This trend has created an ever 
higher demand for more simulation performance. With binary compatibility no longer a strict re- 
quirement in new architectures, a breakthrough concept needs a correctly matched software base 
to demonstrate its full potential. Fast simulators are critical for the development of software in 
conjunction with the design and development of the underlying hardware. Software developers 
(in applications, compilers, and operating systems) are unwilling to devote serious development 
effort until a fast execution platform is available to develop on. It would be desirable if a simulator 
was sufficiently fast for software developers. Unfortunately, software simulators, especially those 
that are performance-accurate, are generally too slow for interactive use, delaying serious soft- 
ware development to after the target is available which, in turn, serializes hardware and software 
development. 

Outside of computer architecture research, fast and accurate simulators can facilitate the 
development, debugging, and performance tuning of both multi-threaded and single-threaded 
applications. Understanding cache interference can be challenging even for single-threaded ap- 
plications, but is especially hard for multi-threaded applications. Even when real hardware exists, 
developing multi-threaded applications can be challenging due to inherent non-determinism and 
the lack of observability in the underlying execution platform. For example, current Intel proces- 
sors only allow the monitoring of up to four performance counters at any given time. Additional 
counters can only be obtained through multiple runs—each run with a different set of specified 
counters. Thus, cycle-by-cycle information is virtually impossible to obtain from such processors. 
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If software-based simulators were sufficiently fast, there would be no need to consider al- 
ternatives to software-based simulation hosts. Unfortunately, today's cycle-accurate simulators 
of uniprocessor targets achieve on the order between 1 Kilo-Instructions Per Second (KIPS) 
and 300 KIPS, depending on the implementation. Such simulators often compromise accuracy 
for performance. Furthermore, there is no widespread use of a parallelized simulator of unicore 
targets despite numerous attempts. The problem is further compounded by the proliferation of 
multicore targets that increases simulator computation at least linearly with the number of target 
cores. There have been attempts to parallelize simulators of parallel targets, such as Graphite [28], 
SlackSim [7], and ZSim [39], but all compromise accuracy in unpredictable ways to achieve scal- 


ability. 


1.4 HARNESSING FPGAS FOR SIMULATION NOT 
PROTOTYPING 


Most, if not all, computer systems today leverage hardware-level parallelism extensively to achieve 
high performance. Accurately simulating the high levels of parallelism seen in current and future 
computer systems on commodity, tightly-coupled multicore processors (with limited threads) is 
inherently slow. A fast and cost-effective alternative to software is to apply a programmable accel- 
erator that directly matches the hardware-level parallelism required in accurate computer system 
simulation. 

Field Programmable Gate Arrays (FPGAs) are programmable devices made up of hundreds 
of thousands (if not millions) of small interconnected lookup tables that can be used to realize 
arbitrary logic functions. Unlike a hardwired application-specific integrated circuit (ASIC), where 
all of the logic is cast permanently in silicon, FPGA-based hardware can be easily iterated upon 
in an incremental design-debug cycle similar to software development (see Appendix A for more 
details on FPGAs). 

Because of their programmability, FPGAs enable hardware to be implemented and main- 
tained by a smaller group of people and at lower cost and time than would be required to produce a 
dedicated integrated circuit. The price for using FPGAs is reduced logic capacity and logic speed 
relative to native integrated circuit implementations, roughly by a factor of 10 [23] for each. Even 
so, FPGAs have been successfully deployed, from “glue” logic to attach ASICs together to accel- 
erate a wide range of applications. For the right application, a good FPGA-based implementation 
can be multiple orders-of-magnitude faster than software. 

A typical but naive way to harness an FPGA for computer system simulation is to imple- 
ment a structurally accurate model of a given target system in FPGAs. For example, to model 
a 16-core multiprocessor, one could instantiate 16 separate processor cores that are identical’ to 
the target cores, and connect them together with a network-on-chip (NoC) identical to the tar- 
get NoC. Such implementations are referred to as prototypes, rather than simulators, because they 


"At least, at the register transfer-level (RTL) but transformations may need to be made to accommodate for the fact that 
FPGAs have different underlying structures than are possible in ASICs. 


4 1. INTRODUCTION 


reflect a one-to-one mapping of the target micro-architecture onto an FPGA. While prototyping 
small systems made up of simple cores is feasible in today’s FPGA technologies, using FPGAs 
to directly prototype larger and/or more complex cores and systems becomes a herculean effort 
when the implementation and integration effort is not that different (modulo the physical design) 
of building the target itself. 

It is our belief that, given today’s target architectures and FPGAs, one should view FPGAs 
as a vehicle for simulation, not prototyping. The goal of simulation (as opposed to prototyping) 
is to mimic the target system behavior at the desired level of completeness, accuracy, detail, and 
speed. How the simulation is actually carried out under-the-hood is of little concern to the user. 
When building a simulator using FPGAs, what is actually implemented on the FPGAs need not 
resemble the target system in any way, structurally or physically. 

In fact, there are many good reasons not to mimic the target system. For example, an 
FPGA-accelerated simulator may take advantage of simplifications that make it easier for con- 
struction, such as a constant memory latency, or implementing cache tags only (and not the data 
storage) of the cache. In the first case, accuracy is compromised to make the simulator imple- 
mentation simpler. In the second case, accuracy is not necessarily compromised if the simulator 
produces sufficiently accurate results even though not every component is perfectly modeled. In 
general, “shortcuts” used in software-based simulators often can also benefit FPGA-based simu- 
lation. 

‘The natural next step is to recognize that the simulation host is not necessarily bound to 
a complete software or FPGA implementation. A practitioner could mix hardware and software 
hosts to generate a faster-than-software-only “good enough” simulator with as little additional 
development time and cost as possible over building a software-only simulator. Thus, the term 
“FPGA-accelerated simulators” is used to indicate hybrid simulators where FPGAs are used to 
accelerate specific components of the simulator, not necessarily the entire simulator. 

Properly constructed FPGA-accelerated simulators are faster than software-only simula- 
tors and can, in fact, be faster than a prototype of the target on an FPGA. Because such simula- 
tors are FPGA-accelerated, rather than FPGA-implemented, software can be used to implement 
components of the simulator that are otherwise inconvenient and/or unnecessary to implement 
on the FPGA. As a result, FPGA-accelerated simulators are not only easier to implement, they 
provide more functionality, including full system support, than would otherwise be possible. 


15 THE REST OF THE BOOK 


Chapter 2 gives a background overview in computer simulation. Chapter 3 presents key con- 
cepts of FPGA-accelerated performance simulation. Chapter 4 presents hierarchical simulation 
and virtualization techniques. Chapter 5 summarizes the current landscape of FPGA-accelerated 
simulators. Chapter 6 offers final concluding remarks. 

To provide the reader with more context, this book describes two case studies: (1) the FAST 
approach [9-11] for studying the performance of uniprocessor and multiprocessor targets using 
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FPGAs, and (2) the ProtoFlex approach [15] for accelerating functional-only full-system sim- 
ulation, which can be used standalone or as a component in the construction of performance 
simulators. Both FAST and ProtoFlex heavily transform and virtualize the model of the tar- 
get system to simplify the efforts needed to achieve simulation completeness and, in the case of 
FAST, also the timing accuracy on a FPGA-based simulation substrate. Both approaches make 
appropriate use of software and FPGA hosts together to achieve the best of both worlds. 


CHAPTER 2 


Simulator Background 


2.1 USES OF COMPUTER SIMULATION 


In computer systems research and design, simulation studies are used when the target system does 
not exist or when a given design is being studied. A simulator implements target behavior in a 
manner that is simpler than building the target—otherwise, one would just construct the target. 
Even when the target system is available, a simulator offers increased controllability, flexibility, 
and observability. Simulators, however, have notable disadvantages compared to the target, such 
as being slower, not accurately modeling the target, or not predicting all the behaviors of the 
target. 

There are two major forms of computer system simulators: functional simulators and per- 
formance simulators. Functional simulation predicts the behavior of the target with little or no 
concern for timing accuracy. Functional behaviors include the execution of CPU instructions and 
the activities of peripheral devices such as a network card sending packets or a DMA transfer from 
disk. The modeling of micro-architectural state that affects performance, such as cache tags, is not 
considered a part of functional simulation. Functional simulation is used for a variety of purposes, 
ranging from: (1) prototyping software development before a machine is built, (2) providing pre- 
liminary performance modeling and tuning, (3) collecting traces for performance modeling, and 
(4) generating a reference execution to check that a performance simulator executes correctly. 

A performance simulator predicts the performance and timing behaviors of the target. Per- 
formance simulation is used for a variety of purposes, ranging from evaluating micro-architectural 
proposals, to studying performance bottlenecks in existing systems, to comparing machines to 
decide which to procure, or to enable the tuning of compilers. Since there are many timing- 
dependent behaviors that a practitioner may want to predict, such as resiliency or power con- 
sumption, "performance" simulation refers to the simulation of any or all of the non-functional 
aspects of the target. The output of a performance simulator can assume a variety of forms—the 
most common example is the aggregate Instructions Per Cycle (IPC) of the target running a 
specific workload. There are, however, other possibilities such as a cycle-by-cycle accounting of 
the number of instructions fetched or issued, or even a cycle-by-cycle accounting of processor 
resources consumed by every in-flight instruction (e.g., re-order buffer). 

Performance can be simulated in a variety of ways. For example, for a simple microcoded 
machine with a fixed latency for every instruction, the performance can be accurately predicted 
using a closed-form mathematical function that accepts the number of each instruction as argu- 
ment. For nearly all other machines, however, a cycle-accurate simulator must model the target's 
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micro-architecture in great detail. The standard way to achieve this is to build a model of every 
performance-impacting component, connect such component models together as they would be 
in the target system, and simulate those models in tandem. 

Functional simulation is typically much faster than performance simulation. Fast functional 
simulators run at roughly the same speed as the host machine, but can incur large slowdowns when 
instrumentation is introduced. Accurate performance simulators are generally at least four orders 
of magnitude slower than their targets. Note that this distinction does not apply to sampling- 
based simulation methodologies that only simulate the computer system accurately for a subset of 
time or instructions and extrapolating results from the samples. Sampling-based simulators still 
require an accurate performance simulator. 


2.2 DESIRED SIMULATOR CHARACTERISTICS 


Simulators have five key intertwined characteristics: speed, accuracy, flexibility, completeness, and 
usability. Simulator design and development is a continuous tradeoff between those characteris- 
tics. We briefly describe them below. 

Speed. The speed of functional and performance simulators is often measured in the num- 
ber of target instructions executed per wall clock second. One notable exception is an analytical 
performance model that does not execute instructions, though one could measure its speed in 
terms of the amount of time it takes to evaluate the model. 

Accuracy. A perfectly accurate functional simulator is one that updates the processor ar- 
chitectural state after each instruction is executed as the target would in a real system, assuming 
atomic and in-order instruction semantics. Generally, if a functional simulator is inaccurate, there 
is a set of programs that it cannot execute correctly or execute at all. 

Performance simulators vary greatly in accuracy. Some performance simulators are mathe- 
matical models that are often less accurate than simulators that account for more temporal detail. 
Other performance simulators are cycle-accurate, implying that the simulator output is accurate 
to a single cycle. If the output of the performance simulator is a single IPC for the entire run, 
that IPC is equal to the total number of instructions executed divided by the total number of cy- 
cles simulated. Likewise, if the output is a cycle-by-cycle accounting of each micro-architectural 
resource used in that cycle, a cycle-accurate simulator can output the exact resources used by the 
target at each cycle. 

It is very difficult to achieve true cycle accuracy. Many simulators claim to be cycle accurate 
but are, instead, cycle-level simulators that model performance at the resolution of a target cycle, 
but are generally not perfectly accurate to the target. 

Flexibility. The flexibility of a simulator is the ease at which the simulator can be changed 
to evaluate different target functionalities. Maximizing flexibility is desirable to enable rapid and 
productive exploration of different targets. 

One of the most widely used computer system simulators today is gem5 [19] that began 
as a combination of the University of Michigan M5 simulator and the University of Wisconsin 
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GEMS simulator. Gem5 embodies the work of many developers over multiple years but exposes 
many parameters, making it flexible across a range of parameters, even allowing for quick changes 
to the ISA. 

Completeness. 'Ihe completeness of a simulator refers to how much of the entire system 
can be modeled. Ideally, the entire system can be simulated as needed. For example, to simulate 
a smartphone, one may want to mimic the hardware of the phone, the software running on top 
of the hardware, but also the user touching the screen, and the cellular network that the phone is 
communicating with. Of course, achieving completeness may be difficult but is clearly desirable, 
all other parameters being equal. 

Usability. The usability of a simulator characterizes the amount of effort required to use 
the simulator for a specific purpose. A usable simulator enables, rather than impedes, the desired 
exploration and evaluation. Usability is dependent on the desired use cases; however, it incorpo- 
rates many attributes, including the ease at which the desired applications, runtime systems, and 
operating systems can be run on the system, how easy it is to sweep parameters, and so on. The 
usability of a single simulator can vary for different usage cases. For example, a functional-only 
simulator may be very usable to count the number of instructions executed across a range of ap- 
plications, but may be very unusable if accurate performance numbers are required. A simulator 
may not support operating systems, requiring compiling applications with libraries that mimic 
the assumed operating system calls. Such a simulator is less than optimally usable for applications 
that do not require complicated operating system functionality, and is not usable if the applica- 
tions do require complex operating system functionality. A simulator that is extremely accurate 
for a particular micro-architecture may not be usable for a functional-only usage case, due to its 
slow speed, or for a different micro-architecture unless it has support for easily modifying the 
micro-architecture to the desired target micro-architecture. 


2.5 PERFORMANCE SIMULATION ACCURACY 


In a performance simulator, accuracy is most often quantified in terms of average percentage 
difference from the performance predicted by a more accurate reference point. Averages ignore 
outliers that often illuminate the successes or failures of the target computer system. 

It is impossible to determine how new applications/runtime systems/OSes might behave 
on a simulator that has only been calibrated against existing applications/runtime systems/OSes. 
However, there are many possible places where inaccuracies could mean the difference between 
a profitable vs. unsuccessful product. It is very difficult to tell just how inaccurate a simulator is 
without an accurate reference point, be it a truly accurate simulator or the real target, to calibrate 
against. 

Any simulator used to make real decisions must offer sufficient accuracy for its uses. It is 
often difficult to tell how accurate a simulator needs to be to generate meaningful results. When 
a company invests billion of dollars to design and build a microprocessor, the stakes are much 
higher compared to generating results for an academic paper. However, if academic paper results 
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are not sufficiently accurate to guide decisions, those papers may become irrelevant or mislead the 
reader. 

In addition, even if the level of accuracy needed is known, it is often impossible to determine 
just how accurate a simulator is. There are two notable exceptions: pure functional simulation and 
true cycle-accurate simulation. If one only needs to be able to run software on a simulator, one can 
test a simulator for its functional correctness. A perfectly accurate simulator is easy to define, but 
virtually impossible to build. Even if the full register transfer logic (RTL) of a target is available 
to be simulated, in real systems non-determinism is difficult to avoid due to issues such as clock 
crossings and human-scale interactions. 

Because there is no way to bound the error for the simulation of an arbitrary target, most 
simulator developers compare their simulation results to either a more detailed simulator, that is 
presumably more accurate, or actual hardware. A functional simulator is the easiest simulator to 
verify, since there are often multiple, verified correct implementations of the functionality. For 
example, one could use an existing computer system, that supports the same ISA, to serve as a 
functional reference point. 

There are at least two forms of simulation inaccuracy. The first form of simulation inaccu- 
racy is abstraction error, where a simplification in the simulation model compared to the target 
creates error. For example, some versions of Simplescalar used a constant to model DRAM la- 
tency when, in reality, DRAM latency varies depending on several parameters including the ac- 
cess pattern (i.e., row buffer locality) and refresh rates. The second form of simulation inaccuracy 
is simulator implementation bugs. For example, a practitioner may forget to include the use of 
a parameter in the simulator, making that parameter irrelevant. Either error can be misleading, 
even worse leading to the wrong conclusions. Both errors are, in general, non-trivial to find, as 
there is generally no good reference to compare against. 


2.4 SIMULATOR DESIGN TRADEOFF 


Simulator design and development is a continuous tradeoff between the five desired characteris- 
tics. For example, if accuracy does not matter (e.g., always predict an IPC of 1) it is trivial to build 
an infinitely fast simulator that is infinitely flexible and very usable. 

‘The speed of a simulator can have a first order effect on the accuracy of the overall simula- 
tion. Modern computer systems incorporate many learning structures, such as branch predictors, 
cache replacement algorithms, prefetching units, OS page allocation algorithms, and TCP/IP 
windows, that require many cycles to warm up. As different programming languages with differ- 
ent execution paradigms become more prevalent, they have very different effects on the execution 
paths and instruction caching. As novel applications are introduced, they can sometimes stress 
systems in different ways than stressed before. Operating systems can consume 75% or more of 
the execution time for many modern applications, making accurately simulating them very im- 
portant. The simulation of many billions of cycles may be needed to accurately observe overall 
system behavior of many interacting systems, each with a significant amount of state. 
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Simulation speed also affects simulator flexibility and usability. If a simulator is already fast 
enough, one can take shortcuts that trade simulation speed for flexibility. Likewise, a fast simulator 
enables more experiments to be run, increasing usability. The effect speed has on flexibility and 
usability is not limited to these examples. 

There appears to be no upper limit to the usefulness of speed in a simulator. Simulators 
significantly faster than real-time targets would still be very useful, enabling the study of many 
alternative configurations. The faster the simulator, the greater the design space can be explored. 
Even the performance of real systems can vary, run-to-run, by tens of percent and, therefore, the 
ability to do many runs with slightly different initial states is important. 

However, simplifications are typically made to enable fast turnaround time, and to improve 
flexibility and usability. For example, not running the OS drastically simplifies the simulator by 
avoiding such complications as privileged instructions and support for device I/O. Another exam- 
ple is to run a limited data set or run a benchmark program rather than a real set of applications. 

Depending on the practitioner, there are varied levels of demand on the accuracy and us- 
ability/completeness of the simulated functionality and timing. Moreover, the user may wish to 
uncover different levels of detail to understand the inner workings of the system, and may have 
different requirements of the speed of the simulation relative to real-time. At one extreme, if 
the user requires total knowledge of every wire in real-time, there is little alternative to building 
an appropriately instrumented target, as no simulator will run fast enough. However, if the user 
relaxes their requirements (in even just one tradeoff dimension), many shortcuts become possi- 
ble in the construction of the simulator. Developers of software-based computer simulators have 
long taken advantage of a diverse range of shortcuts in simplifying their efforts while nevertheless 
constructing simulators with "good enough" accuracy. 


2.5 SIMULATOR PARTITIONING FOR PARALLELIZATION 


Simulators can be partitioned in a variety of ways that improve performance through paralleliza- 
tion. In this section, we describe three basic orthogonal partitioning schemes. 


2.5.1 SPATIAL PARTITIONING 


A natural way to partition a simulator is at the spatial boundaries of the target system, enabling 
the simulator to exploit parallelism that has already been exploited in the target, presumably to 
improve target performance. For example, core-/evel partitioning (Figure 2.1 (i)) partitions the 
simulator at the target cores/shared cache level, while module-/evel partitioning (Figure 2.1 (ii)) 
further partitions the simulator into structural modules (target core fetch/decode/rename/etc, 
target cache banks). 

Spatial partitioning has been studied in previous work [2, 8, 16, 29] in the realm of software 
to minimize bandwidth and to tolerate latencies for enabling efficient mapping onto commod- 
ity processors. Though superlinear speedups were achieved in limited cases when compared to a 
sequential host running the same number of target cores, due to the additional host cores provid- 
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Figure 2.1: Simulator partitioning at different boundaries. 


2.5. SIMULATOR PARTITIONING FOR PARALLELIZATION 13 


ing additional host cache capacity [8], most artifacts achieved 50% efficiency given unicore host 
speeds around 100KIPS. Penry et al. [35] studied automatically parallelizing Liberty simulators, 
including those of an unicore target, and also achieved roughly 5096 efficiency. 

In contrast to software-parallelized results, using core-/eve/ partitioning on an FPGA-based 
simulator can achieve linear speedup due to the low cost of synchronization on an FPGA [49]. 
Core-level partitioning, however, often cannot extract sufficient parallelism to provide large sim- 
ulation performance improvements, since the simulation is still bounded by the time required to 
simulate a single target cycle on a single target core. By decomposing at the module-level it is pos- 
sible to reduce this bottleneck to the time required to compute a single target module cycle [33]. 

However, simply mapping each core or module to its own dedicated set of FPGA resources 
can consume significant area and incur losses in efficiency. Instead, many FPGA-simulators em- 
ploy a form of virtualization, using time-multiplexing to map a set of cores or modules onto a 
single FPGA computation resource. This approach, known as mu/ti-threading, is detailed further 
in 4.3.1. 


2.5.0 TEMPORAL PARTITIONING 


Another approach to extracting parallelism is to split a single performance simulation into a set of 
simulations, each simulating a different chunk of target time, but not necessarily simulating all of 
target time of interest or at the same accuracy. One example of temporal partitioning is statistical 
sampling, such as [40] and [50], that are often used in software-based simulators to reduce the 
amount of time required for detailed simulation. Using a functional simulator to fast-forward to 
different points during program execution, it is possible to extract temporal parallelism. The key 
disadvantage in such approaches is the inaccuracy created by the fast-forward process and the need 
to run the performance simulator for significant periods of time to warm up micro-architectural 
state after the fast-forward period. Furthermore, when simulating multiprocessor targets, fast- 
forwarding often uses long instruction quantum to interleave the execution of target cores for 
efficiency purposes. This coarse-grained interleaving can distort the execution of a multithreaded 
program, resulting in performance inaccuracies. 

As FPGA-accelerated simulators can be designed to solve many of the problems asso- 
ciated with sampling simply by the increase in simulation speed, extracting temporal parallelism 
from an FPGA-accelerated simulator is often realized differently than in software. Some FPGA- 
accelerated simulators attempt to solve the problems introduced by sampling (coarse-grained ex- 
ecution and lengthy warm up periods) by mapping these activities onto the FPGA [14]. While 
temporal parallelism is not extracted directly by the FPGA simulator itself, using functional ex- 
ecution interleaved on single-instruction basis and maintaining constant warm up of key micro- 
architectural structures, the FPGA-simulator can enable extracting temporal parallelism in soft- 
ware. 

Another approach to temporal decomposition in FPGA-accelerated simulators is to exploit 
the parallelism between multiple independent experiment trials. As the target cycles in different 
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experiments are independent, it is possible to simulate multiple target cycles from different trials 
at the same time on an FPGA-simulator [49]. Such an approach is useful when the target system 
may only contain a few target cores, allowing multiple small targets to be aggregated into a single 


simulation job on the FPGA. 


2.5.3 FUNCTIONAL/TIMING PARTITIONING 


It is also possible to partition a simulator into a functional partition/model and a timing parti- 
tion/model. The functional partition simulates the functionality of the target system, while the 
timing partition predicts the performance (and/or power, temperature, reliability, etc.) of the tar- 
get. In a software context, such a partitioning is traditionally used to promote reuse [4, 17] but 
not for parallelism. Functionality changes slowly, as it is exposed as a contract to the entire soft- 
ware stack via the ISA, while the micro-architecture, described in the timing model, changes 
frequently. Thus, change is mostly isolated in the timing model between successive architectural 
refinements while the same functional model can be designed and verified once and then reused 
with infrequent modifications. 

For example, by partitioning along functional/timing at the core-level, the ISA behavior 
and the micro-architectural timing become separate simulation entities (Figure 2.1 (iii)) that can 
then be parallelized. Conceptually, the functional and timing partitions are connected by an ab- 
stract trace buffer that contains a trace consisting of multiple trace buffer entries generated by the 
functional partition, each containing information about its instruction such as the opcode, source 
and destination register names, instruction pointer, data addresses, and so on. 

A timing partition uses trace information to accurately predict when activity occurs in the 
target. For example, a timing model uses functionally-generated addresses to determine if a load 
hits in the cache. The timing model also uses functionally generated source and destination reg- 
ister names to determine register dependencies. As long as the functional information is exactly 
what the actual target would have generated, a correct timing model can model the target per- 
fectly accurately. In general, however, a functional model requires timing information to generate 
a target-correct trace. For example, the functional partition must know when a branch is mis- 
predicted and when it is resolved to generate the correct wrong path instructions. Likewise, the 
functional partition must know when a load is performed in relationship to a store to the same 
location to return the target correct value to the load. 

In the context of FPGA-based simulators, partitioning along the functional/timing bound- 
ary allows for a variety of mapping choices. By selecting which parts of the simulation are mapped 
to the FPGA and how these components interact with different simulator organizations can be 
constructed. Chapter 3 details the design space of simulators along this dimension in more detail. 


2..4 HYBRID PARTITIONING 


The three forms of partitioning are often employed in concert with each other. For example, spatial 
partitioning can be first applied at the core-level, then functional partitioning can be applied to 
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extract the ISA behavior/micro-architecture timing components. The micro-architecture can then 
be further spatially partitioned to extract timing module-level parallelism. 


2.6 FUNCTIONAL/TIMING SIMULATION 
ARCHITECTURES 


As discussed previously, simulators are distinct from prototypes and, therefore, have different 
architectures than their targets. While each of the three dimensions of parallelization may be 
explored with different simulator architectures, the functional/timing dimension is a defining 
characteristic of processor simulation. ‘The five basic functional-timing simulator architectures 
shown in Figure 2.2 have been categorized in the literature [10, 27] as (i) monolithic simula- 
tors (sometimes called integrated simulators), (ii) timing-directed simulators, (iii) functional-first 
simulators, (iv) timing-first simulators, and (v) speculative functional-first simulators. 


2.6.1 MONOLITHIC SIMULATORS 


A monolithic simulator combines target functionality and target performance prediction in a 
monolithic piece of code. A monolithic simulator is, in fact, one form of an implementation 
of the target and thus might be considered a prototype, but it is likely to have been structured 
differently or simplified in some way from the target. 

Some monolithic simulators compute every event that happens on each cycle for each com- 
ponent. Thus, each component of the functionality of the target is performed at the correct rela- 
tive target time compared to all of the other components of the target. For example, executing an 
ADD instruction on an out-of-order processor requires several steps, many of which are timing 
dependent. The ADD instruction must be fetched from instruction memory, decoded, register- 
renamed, dispatched to a reservation station, wait for operands, issued to the ALU (if necessary), 
completed, written back and sent to the other reservation stations, and retired in order. A mono- 
lithic simulator does not separate functionality from timing; thus, the fetch occurs at the correct 
target time, ensuring that a store to the instruction address in the case of self-modifying code 
occurs at the correct moment in target time. 

A sufficiently detailed register transfer logic (RTL) style description of a microprocessor, 
whether it is written in Verilog, C, or any other language, is a common example of a monolithic 
simulator. In such a simulator, every register in the design is instantiated and correctly written on 
each target cycle. 

Despite the potential for high levels of accuracy, monolithic simulators are difficult to write 
and modify, since they are detailed and require each operation to be carried out at the correct 
target time, even though it might not be necessary to perform the operation at the correct target 
time to be either functionally or timing accurate. 
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Figure 2.2: Simulator architectures. Functional model components are in light green, timing model 
components in dark green, monolithic components in gray. 


2.6. FUNCTIONAL/TIMING SIMULATION ARCHITECTURES 17 
2.6.2 TIMING-DIRECTED SIMULATORS 


Timing-directed simulators were developed to address the design complexity of monolithic sim- 
ulators and to promote code reuse. 

Timing-directed simulators are factored into a timing model and a functional model. The 
functional model performs the actual tasks associated with fetching an instruction, decoding an 
instruction, renaming registers, and actually executing the instruction. When the timing model 
determines that some functionality should be performed, it calls the appropriate functional model 
to perform that function and to return the result (if the timing model depends on that result to 
proceed). Thus, the functionality is performed at the correct target time as it would in a monolithic 
simulator. However, the functional model is implemented separately and can be reused along with 
multiple timing models. 

At a high level, a timing model only needs to model activity that impacts timing and, 
therefore, does not need to model activity that only impacts functionality. For example, given a 
fixed latency integer ALU, one only needs to model its latency in the timing model, rather than 
modeling the functionality of the ALU. One simple way to model latency is to delay the output 
by the desired latency. Another example is the fact that simulated caches do not require actual 
data. A third example is that instructions do not need to be decoded (since they have already 
been decoded by the functional model). Thus, timing-directed timing models (and many timing 
models in general) appear to be aggressively stripped down targets, with only the performance 
skeleton remaining. 

To be accurate, the timing model must capture a tremendous amount of tightly connected, 
parallel activity. In fact, this is the reason why fast microprocessors must be implemented in hard- 
ware. If there was a way to efficiently simulate an aggressive microprocessor target on a multi- 
processor host, one could likely use those techniques to make a faster processor. Thus, as the 
complexity of the target computer system grows, the timing model gets progressively slower. The 
timing model is the bottleneck for both a functional-first simulator (described later in this section) 
and a timing-directed simulator. 

The Intel Asim [17] simulator is a timing-directed simulator that utilizes a functional model 
to perform decode, execute, memory operations, kill, and commit. PTLSim [51] is a timing- 
directed simulator that has a functional model that performs instruction operations, but does not 
actually update state, which is left to the timing model. The M5 “execute-in-execute” simulator is 
a timing-directed simulator that functionally performs the entire instruction when the instruction 
is executed in the timing model. 

Depending on (i) the level of accuracy desired, (ii) the target, and (iii) the decision as to how 
functionality is partitioned, the functional model of a timing-directed simulator often reflects the 
target system at least to some degree. For example, an Asim functional model provides infinite 
register renaming to enable simulation of targets with register renaming. Thus, it is possible that 
a target might require a specialized functional model to accommodate it. 
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The timing model and the functional model in a timing-directed simulator are very tightly 
coupled with bi-directional communication occurring several times per target cycle. As a result, 
exploiting timing-directed partitioning to parallelize simulators is unlikely to result in speedups. 
For example, when simulating a target with an idealized pipeline with one instruction committed 
per cycle (IPC=1), there will be an average of one set of blocking calls between the timing model 
and functional model every target cycle. Ihe blocking calls sequentialize the computation, limiting 
parallelism. Thus, if 100 Million Instructions Per Second (MIPS) of simulation performance is 
desired, assuming a minimal one call into the functional model per instruction, an interaction 
occurs every 10ns. The communication latency between any CPU and any off-chip component 
by itself, not counting any time to perform the functional model, will be significantly longer 
than 10ns. Thus, it is not surprising that we are not aware of any software-hosted simulators 
parallelized on timing-directed boundaries. Instead, this partitioning is intended purely for reuse 
and complexity mitigation. 


2.6.3 FUNCTIONAL-FIRST SIMULATORS 


Executing instructions in a processor could potentially interact with other instructions. For exam- 
ple, if the microprocessor has multiple heterogeneous decoders, the decoder selected for a partic- 
ular instruction depends on other instructions being decoded at roughly the same time. However, 
in many such cases, such interactions do not have an effect on functionality and, therefore, can 
be safely ignored without compromising functional accuracy. 

Functional-first simulators were designed around the assumption that timing does not af- 
fect functionality. For simulator implementation convenience, a functional-first simulator sep- 
arates functionality from timing as does a timing-directed simulator. Thus, a functional model 
only needs to be developed once and can be reused across a variety of timing models of different 
targets with varying accuracy. Unlike a timing-directed simulator, the functionality is executed 
before the timing, thus accounting for the name “functional-first.” Doing so further simplifies 
the simulator. The functional model executes the program at an architectural level, executing in- 
structions and modifying architectural state. As it does so, it generates an instruction trace that 
contains information such as the instruction address, the instruction itself, the opcode, source 
registers, destination register(s), data addresses, and so on, that the timing model needs to pre- 
dict performance. ‘Thus, a functional model executes ahead the timing model, feeding the timing 
model an instruction trace providing all the necessary information the timing model needs. 

The functional model does not need to run at the same time as the timing model. It is com- 
mon to generate and store the instruction trace on disk and piping the traces from disk through 
the timing model. Many functional-first simulators have been written including versions of Sim- 
plescalar [24], SESC [38], and Graphite [28]. 

With multiprocessor targets, functional execution requires some form of instruction inter- 
leaving, even if it is simply based on the performance characteristics of the host machine. As a 
result, a simulator designed as a functional-only simulator of a multiprocessor target is effectively 
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a functional-first simulator attached to some form of implicit timing model. A simple form of 
an implicit timing model is an idealized single-cycle pipeline model that allows for deterministic 
interleaving at a single-instruction granularity, which was done in the ProtoFlex simulator. 


2.6.44 TIMING-FIRST SIMULATORS 


Timing-first simulators [27] address the difficulty of creating a complete performance simulator 
that accurately models all target functionality. A timing-first simulator includes a performance 
simulator that executes target functionality. Thus, it can be considered a monolithic simulator, 
or even a timing-directed simulator. However, it does need to implement all functionality, just 
the most commonly used functionality. As each instruction is retired, a separate, known-to-be- 
correct, and generally complete functional simulator executes the same instruction within its own 
copy of the architectural state. Ihe architectural updates from the performance simulator are com- 
pared to the architectural updates from the known correct functional simulator. If the updates are 
the same, no further action is required and execution continues. If not, the performance simula- 
tor's pipeline is flushed, its architectural state is forced to be the same as the functional simulator, 
and execution continues from that point. 

Thus, the performance simulator component of a timing-first simulator does not need to 
implement the full instruction set and peripheral functionality, nor does the implementation need 
to be absolutely correct. It relies on the functional simulator to detect and correct omissions and 
errors. This capability reduces the complexity of designing a full-system simulator. 

One downside to timing-first is that every instruction that is being modeled accurately 
must be executed by both the performance model and the functional model. In addition, every 
omission/error detected and corrected introduces simulation error due to the pipeline flush and 
restart. Also, due to the use of a functional simulator for verification, which executes instructions 
inorder, a timing-first simulator is limited to targets that only support sequential consistency. A 
non-sequentially consistent memory ordering would be flagged as incorrect, even though it might 
be target-correct in a non-sequentially consistent memory model. 


2.65 SPECULATIVE FUNCTIONAL-FIRST 


While functional-first simulation makes the assumption that functionality is divorced from tim- 
ing, they are actually related. For example, branch mispredicts and resolves, the relative ordering 
of multiprocessor load-stores, and atomic read-modify-write ordering can all directly affect the 
functionality of the processor model. In such cases, a functional-first simulator is inaccurate. This 
is precisely the reason why a timing-directed simulator must break up a monolithic functional 
model into a series of micro-functional steps in order to tightly control the interleaving and ap- 
plication of side-effects for each of these steps. 

Speculative functional-first (SFF) simulation [9, 10] is derived from functional-first sim- 
ulation, but addresses the functional-first accuracy problem while providing opportunities for 
parallelization and FPGA-acceleration. It is based on two observations. The first is that micro- 
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functional synchronization enforced by a timing model is only necessary when the instruction 
stream and the implied execution produced by the functional model differs from what the target 
would have produced. An observable race, such as when one target core loads from the same loca- 
tion that another target core is storing to, is an example of where the functional model differs from 
the target. The second observation is that scalable target applications have few observable races 
in the functional model, while all applications, scalable or not, will have very frequent observable 
races in the timing model since timing model modules generally communicate every cycle. 

SFF's performance approaches that of functional-first, but preserves the correctness of 
timing-directed simulation. Like a functional-first simulator, an SFF simulator’s functional model 
executes and initially populates the trace without timing model input. The timing models reads 
information from the trace as it needs it. However, rather than assuming that the information is 
correct, the timing model (i) detects when the trace information has diverged from being target 
correct and (ii) corrects the diverged trace by providing sufficient information to the functional 
model to enable it to regenerate the trace so it is target correct. 

Divergence between the functional-execution and timed performance simulation is de- 
tected by propagating values generated by the functional model through the timing model into a 
set of timing model oracles. Functional values for potentially inconsistent values (e.g., branch tar- 
gets and memory load/store values) are provided as part of the functional trace. When the timing 
model executes a store, the functional value is stored into the timing model memory oracle. When 
the timing model executes a load, the load is performed from the timing model memory oracle 
and compared against the functional value. If the functional value differs from the target value, 
the functional model is ¢arget incorrect and must be corrected before the timing model can con- 
tinue. Since the timing model is only responsible for propagating the values, not generating them, 
the normal division of work between functionality and timing of a functional-first simulator is 
preserved. 

Correcting for functional execution requires communicating back the point of divergence 
and the corrected value that should have been used during the functional execution. The func- 
tional model then rolls back its execution (using a standard checkpoint-replay scheme), corrects 
its execution at the point of divergence, then continues execution. In the case of an incorrect 
functional load, the functional model is rolled back to the load instruction, the value is corrected, 
and the following instructions are rep/ayed with that new value up to, but not including the first 
unexecuted dynamic instruction. 

Multiple rollbacks may occur and their effects, therefore, should accumulate. Multiple roll- 
backs can be supported through the use of a rep/ay /og that contains an entry for every possible 
unretired load. On the execution of a dynamic load instruction, the corresponding entry is popu- 
lated with the functionalload value. When rollback is initiated due to divergence, the appropriate 
entry in the replay log is overwritten with the corrected load value. As instructions are replayed, 
the load instructions use the load values from the replay log, rather than re-execute the loads 
against the memory. New trace entries are generated that replace the original trace entries gen- 
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erated during execution or a previous replay. Once the last dynamic instruction to be orginally 
executed is replayed, the simulator transitions to executing instead of replaying. 

SFF simulators enable the accurate simulation whenever the functional model may differ 
from the target and, therefore, is much more general than just correcting memory order. For 
example, a functional model generally does not generate wrong path instructions. By providing 
a branch predictor oracle (which is simply a model of the target branch predictor) in the timing 
model, the timing model can then compare the instruction pointer generated by the functional 
model with the predicted branch. When a divergence is detected, the functional model is rolled 
back and the branch is taken as the target would have. On branch misprediction correction, the 
branch divergence is detected again and the functional model is corrected again. Nested branches 
are handled by a branch target log, conceptually identical to the load replay log, that is replayed 
when instructions are replayed after a correction, before an unexecuted dynamic instruction is 
reached. 

Speculative memory and arbitrary memory models can be easily simulated in the same 
way. Thus, speculative functional execution with timed performance correction allows an SFF 
simulator to overcome the correctness limitations of a normal functional-first while maintaining 
the basic instruction-at-a-time execution model. 

An example of a speculative functional-first simulator handling branch misprediction is 
shown in Figure 2.3. The instruction pointer 72, pointing to the instruction BRz L1, is a mis- 
speculated branch. At time T = 1, the functional model has already executed five instructions. 
The first two instructions are in the timing model and the last three are still waiting in the trace 
buffer. In the first stage (fetch), /2 is detected to have been mispredicted by comparing the in- 
struction pointer produced by the functional model and inside the trace with the branch predictor 
predicted instruction pointer in the timing model. The timing model notifies the functional model 
that I2 is misspeculated and to continue executing from /4x (either a ^" or a dark border is used 
to indicate a wrong-path but target-correct instruction.) 

The timing model is stalled until the first wrong path instruction, /4, arrives. m cycles 
later (T = 1 + m), two wrong-path instructions, /4« and /sx, have been sent to the trace buffer, 
overwriting the target incorrect instructions /5, 14, and Z5. The timing model fetches /4, on 
the next cycle and Js, on the cycle after that, feeding each into the pipeline. The timing model 
resolves the branch at T = 3 + m, notifying the functional model. The functional model then 
rolls back and regenerates /5, /4, and I5, overwriting (shown by the black trace buffer entries) 
the wrong path instructions by time T = 3+ m + n. The next cycle, the timing model fetches 
the next instruction for its Fetch unit and commits /ı by advancing the commit pointer. As the 
timing model commits, it notifies the functional model, allowing the functional model to reclaim 
rollback resources. 
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Misspeculation Code Example: 


1: RO = RO + R2 
2: BRz L1 
3: RO = RO + R3 
4: L1: 

RO = RO + R4 
5i. 


Figure 2.3: Illustrating the handling of misspeculation in SFF. 


2.7 SIMULATION EVENTS AND SYNCHRONIZATION 


Simulating target time and events is another aspect of simulator organization. Software is not 
naturally parallel but hardware is. Software must execute each concurrent operation sequentially, 
but in the correct order to correctly read and update state. One way to simulate concurrent op- 
erations is using simulator events. An event is a piece of code that models a particular operation 
that might execute concurrently with other events in the target. Each event is stamped with a 
time that indicates the target time when it will execute. As events are created or recycled, they are 
stored in an event wheel or queue. Events can be created for a future time. Events are dequeued 
and executed from the event queue in target time order. 
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It is possible that multiple events are for the same target time. For example, the first event 
may be the producer of the data of a pipeline register implemented with a master-slave flop-flop 
and the second event may be the consumer of the data of a pipeline register. Thus, both events 
should execute at the same target time. Depending on how the simulator is written, however, the 
order in which those two events are executed in the simulator may or may not matter. 

If, for example, the pipeline register is implemented as two variables, each representing a 
latch in a master-slave flip-flop, either the first event or the second event can execute first, since 
the first event will write into the first variable and the second event will read from the second 
variable. There is not the possibility of the second event reading the data written by the first event 
on that same cycle. Such an approach, however, requires the first variable be copied to the second 
variable after both events have completed executing, thus requiring another simulator event to 
occur after target events have completed. 

As an alternative, the pipeline register could be implemented as a single variable. In that 
case, the second event must be executed before the first, ensuring that the second event executes 
with the pipeline register's value from the previous cycle. In general, the events must be sorted in 
reverse pipeline order, where the end of the pipeline executes first and the front of the pipeline 
executes first. If, however, the pipeline has a loop, sorting is impossible. In that case, at least one 
pipeline register must be simulated by double variables. 

Event queues provide the capability to "skip" target time when there is no other activity. 
For example, if one was simulating a HEP-like eight stage multithreaded microprocessor that 
only executes an instruction from a thread every eight cycles and was only running one thread, 
there is no point to simulate seven stages that are not active at any given target time. One would 
only need to simulate the one active stage out of eight at any given target time. One event-based 
simulator strategy would, as it executes each event, enqueue an event to simulate each successive 
stage into the next target time. 

An alternative to event-based simulation is cycle-by-cycle simulation that executes every 
component/event every cycle. Such a scheme may seem less efficient, since there may be times 
when a particular component has nothing to do, but doing so eliminates the event queue over- 
heads. ‘There are cases where a cycle-by-cycle simulator is faster than an event-driven simulator, 
especially if the events are appropriately statically scheduled to eliminate overheads [21]. 

Maintaining synchronization between timing events requires simulating a consistent tar- 
get clock. Maintaining synchronization between functional events on the other hand, requires 
simulating an instruction interleaving that adheres to a desired target cycle interleaving. 

In monolithic or timing-directed simulator designs, it is important to decide how to sim- 
ulate a consistent target clock as all events are timing events. In a functional-first simulator, how 
functional instruction execution is interleaved across multiple target cores can be chosen to bal- 
ance a tradeoff between accuracy and efficiency. For example, simulation efficiency might be very 
high if one target core's instructions are completely executed before another target core's instruc- 
tions are completely executed. However, since one target core's instruction execution can affect 
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another target core's instruction execution, executing one to completion before the other can result 
in inaccurate results. Fine-grain interleaving may more closely model the actual target instruction 
interleaving but reduce the overall simulation efficiency. 


2.7.1 CENTRALIZED SYNCHRONIZATION 


Centralized synchronization uses a centralized component to synchronize across multiple events. 
On a sequential host, centralized synchronization between events is naturally implemented by a 
correctly implemented event queue and/or cycle-by-cycle simulation. 

Centralized synchronization has also been used in parallelized simulators [8]. An example 
is a centralized component running on a single core that other components can call to perform 
the synchronization activity. Doing so, however, limits the amount of parallelism to the amount 
of throughput that the centralized synchronizer supports. 

The implication of a centralized barrier synchronization is that everything from one target 
cycle (or sub-cycle) must be complete before the next target cycle (or sub-cycle) starts. Doing so 
simplifies the simulator, at the potential cost of performance, especially in a parallelized simulator. 


2.7.2 DECENTRALIZED EVENT SYNCHRONIZATION 


One can also implement synchronization in a decentralized way, where synchronization is imple- 
mented on a per-connection basis, rather than in a centralized component. What that implies is 
that different parts of the simulator can be active at the same time, providing more opportunities 
to parallelize the simulator. 

For example, A-Ports [34] used in Asim, or FAST-Connectors [11] can provide decen- 
tralized synchronization. Each simulator module may have multiple inputs, each represented by 
the endpoint of an A-Port that is connected to that module. ‘The module waits for inputs on all 
of its A-Ports before it executes. A-Ports require activity every target cycle, even if it is a null 
input, which keeps each input on the module synchronized with the others. The latency of such 
ports/connectors can be set, along with the bandwidth, enabling it to model a wider interface (say 
a one/two/four issue processor) using the same code. 
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CHAPTER 3 


Accelerating Computer System 
Simulators with FPGAs 


FPGA-based simulators have been proposed to speed up simulation by parallelizing the various 
activities of a simulator and mapping these activities to execution resources on the FPGA fabric. 
Thus, the first task in developing an FPGA-accelerated simulator is to determine how they should 
be partitioned for parallelization, and what components should be run on the FPGA and what 
components should be run in software. 

Becaues FPGAs are hardware, they are quite good at exploiting simulator parallelism, es- 
pecially when simulating hardware targets. One can execute events in parallel in hardware. Of 
course, an FPGA often contains fewer resources than the target, resulting in one of two possibili- 
ties: either not all of the parallelism of the target is exploited in the simulator or multiple FPGAs 
are used to provide additional resources so that all of the parallelism of the target can be directly 
exploited in the simulator. Multiple FPGAs, themselves, require partitioning. 

This chapter describes how FPGAs can be applied to accelerating simulators of computer 
system targets, focusing on how the partitioning of standard simulator architectures can be lever- 
aged to partition simulators for FPGA acceleration. It concludes with a case study example of 
an FPGA-Accelerated Simulation Technologies (FAST) simulator, a speculative functional first 
FPGA-based simulator. 


3.1 EXPLOITING TARGET PARTITIONING ON FPGAS 


A computer system target is naturally partitioned. A simulator can be partitioned on those target 
boundaries. For example, one could parallelize a two-core processor by mapping one target core 
onto a host core and the other target core onto another host core. Using core-level partitioning on 
an FPGA-based simulator can achieve linear speedup due to the low cost of synchronization on 
an FPGA [49]. This core-level partitioning is often insufficient in extracting sufficient parallelism 
as the detail of the core increases, since the simulation is still bounded by the time required to sim- 
ulate a single target cycle on a single target core. By decomposing at the module-level it is possible 
to reduce this bottleneck to the time required to compute a single target module cycle [33]. 
However, simply mapping each core or module to its own dedicated set of FPGA resources 
can consume significant FPGA area and, therefore, reduce computation efficiency. Instead many 
FPGA-simulators employ a form of virtualization, using time-multiplexing to map a set of cores 
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or modules onto a single FPGA computation resource. This approach, known as multi-threading, 
is detailed further in 4.3.1. 


3.2 ACCELERATING TRADITIONAL SIMULATOR 
ARCHITECTURES WITH FPGAS 


Instead, one could use the partitioning found in traditional simulator architectures to accelerate 
with FPGAs, leveraging the extensive experience and simulator design from the literature. 


3.2.1 ACCELERATING MONOLITHIC SIMULATORS WITH FPGAS 


A monolithic simulator is either not partitioned, or partitioned on module boundaries (see pre- 
vious section.) Because they are not partitioned into functional/timing components, it becomes 
difficult to accelerate only parts of the simulator in FPGA and leave other parts in software. Thus, 
monolithic simulators are oftentimes close to prototypes. In such cases, an FPGA implementa- 
tion would also be close to a prototype and, therefore, likely to be difficult to implement in the 
case of a realistic target. In a realistic target, the number of resources would be large, the intercon- 
nection rich, and many multi-ported memories are present, that all consume a significant number 
of FPGA resources. 

It is possible that performance prediction is somewhat abstracted. For example, one could 
make DRAM latency a constant. However, because full target functionality is implemented in 
a monolithic simulator and, therefore, difficult to reuse, the implementation costs are generally 
high. In addition, prototypes are often not sufficiently flexible to enable exploration, but tend 
to only model their particular target micro-architecture and slight variations. Versions of Lib- 
erty [35], RAMP-Red [30], RAMP-Blue [22], and RAMP-White [1] are examples of mono- 
lithic simulators implemented on FPGAs. 


3.2.2 ACCELERATING TIMING-DIRECTED SIMULATORS WITH FPGAS 


There are many interactions between the functional model and timing model in an accurate 
timing-directed simulator. Thus, implementing a fast timing-directed simulator in an FPGA re- 
quires both the functional model and the timing model to be implemented on the FPGA to 
minimize communication costs. Implementing the functional partition in the FPGA consumes 
resources that scales with the complexity of the modeled instruction set, making simpler ISAs 
more attractive to implement on an FPGA. Implementing a full x86-64 ISA on an FPGA, for 
example, would consume significantly more FPGA resources than a Sparc V8 ISA. A functional 
model becomes, in essence, a monolithic simulator of the ISA that is itself quite complex. 

The Intel/MIT HAsim simulator [33] is an FPGA-based timing-directed simulator that is 
effectively an FPGA implementation of an Asim simulator. RAMP-Gold [49] is another timing- 
directed simulator implemented on an FPGA. It has a functional model that is separate from the 
timing model and split into several components that each can be individually directed to process 
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by the timing model. Both the functional model and the timing model are implemented on the 
FPGA to minimize the communication overhead. Simulating many target structures requires a 
tremendous number of resources which the FPGA is unlikely to have. Both HAsim and RAMP- 
Gold use transplant and multithreading technologies, both discussed in Chapter 4 to provide full 
system capabilities and support larger targets respectively. 


3.2.3 ACCELERATING FUNCTIONAL-FIRST SIMULATORS WITH FPGAS 


The same reasons to implement a timing model on an FPGA apply to a functional-first simula- 
tor as they do to a timing-directed simulator: there is a tremendous amount of tightly coupled 
parallel activity that is exactly what hardware fundamentally implements. 'Thus, one could imple- 
ment a functional-first timing model on an FPGA and feed it with an instruction trace that is 
either dynamically generated by a functional simulator, virtual machine, or even a microprocessor 
modified to generate a trace, or stored on a disk and piped to the timing model. ReSim [18] is an 
example where an FPGA was used to implement the timing model of a functional-first simulator, 
running at roughly 22M Hz which is considerably faster than an accurate software-based timing 
model. 


3.2.4 ACCELERATING TIMING-FIRST SIMULATORS WITH FPGAS 


To the best of our knowledge, no true timing-first simulators have been implemented on or ac- 
celerated with an FPGA. Certainly it could be done, even with an FPGA-based performance 
model and a software-based functional model. However, the FPGA-based performance model 
becomes, in essence, a monolithic simulator. The performance simulator would somehow push its 
architectural state changes to the functional simulator for comparison. Instead of pushing all of 
the changes, a checksum/hash could be generated from the architectural state in order to improve 
checking performance. 

A timing-first simulator would be limited by the speed of slowest component. In addition, 
such a strategy would still have the limitations of timing-first simulation, including the inability 
to model arbitrary memory models and the inaccuracies introduced by any error or omission 
from the performance simulator. For these reasons, a timing-first simulator does not appear to be 
amenable to FPGA acceleration. 


3.2.5 ACCELERATING SPECULATIVE FUNCTIONAL-FIRST WITH FPGAS 


Speculative functional-first (SFF) can be used in a software simulator to promote reuse by making 
a potentially very complex functional simulator able to accurately predict performance. Doing so, 
however, would not improve performance. 

SFF was originally conceived and designed to be accelerated on an FPGA, specifically 
running the functional model in software and the timing model on an FPGA. Since round-trip 
interaction between the two models occurs only when the timing model detects a divergence that 
needs to be corrected, the expectation is that such round-trips occur infrequently and, there- 
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fore, the functional and timing models can be implemented on hosts that are relatively far away 
from each other. Thus, the communication latency between an FPGA-based timing model and 
a software-based functional model can be tolerated. 

Since the functional model can run in software, it can be derived from an existing simulator. 
One could even start with a full system simulator that can boot operating systems and run standard 
applications. Thus, it offers an alternative to the transplant method described in the next chapter. 


3.2.6 ACCELERATING COMBINED SIMULATOR ARCHITECTURES WITH 
FPGAS 


Similar to the approach of timing-first (which effectively combines a monolithic simulator with a 
functional-only simulator), it is possible to combine these approaches. For example, the ProtoFlex 
simulator [14] is a SMARTS simulator that uses fast functional simulation to warm the caches 
and branch predictor and periodically runs a detailed performance model using those warmed 
up caches and branch predictor. The fast functional model is implemented on an FPGA while 
the performance model is implemented in software. This approach preserves the benefits of soft- 
ware simulation while enabling more accurate fast-forwarding and sampling due to the ability 
of the FPGA to enforce fine-grain instruction interleaving and constantly warmup key micro- 
architectural resources. 


3.3 MANAGING TIME THROUGH SIMULATION EVENT 
SYCHRONIZATION IN AN FPGA-ACCELERATED 
SIMULATOR 


Simulator synchronization can ensure that different target components are at the same target 
time. There are two basic strategies when designing a synchronization network across an FPGA: 
centralized and decentralized. Centralized schemes can offer simple, easy-to-verify and relatively 
high performance when the number of events is limited. When the number of events grows larger, 
decentralized schemes that exploit the spatial organization of the target itself allow better timing 
and resource utilization. 


3.3.1 CENTRALIZED BARRIER SYNCHRONIZATION IN AN 
FPGA-ACCELERATED SIMULATOR 


A centralized synchronization scheme tuned for an FPGA can leverage the organization of the 
FPGA along with deterministic execution delays to simplify the design. For example, if we design 
a barrier-synchronization scheme to enforce single target-cycle synchronization between multiple 
concurrent target-cores, we can select a strict round-robin scheduler when scheduling a target- 
core onto a given FPGA resource with deterministic execution delay. As FPGA computation 
resources often have fixed latencies, implementing synchronization through round-robin schedul- 
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ing is a relatively easy mechanism but limits overall throughput based on the latency of the slowest 
component. 

On the other hand, if we must tolerate variable latencies across multiple simulator com- 
ponents, a more generic centralized scheduler can be created that tracks the completion of these 
target-cores. For example, within a single host pipeline of the RAMP Gold simulator [49], in- 
structions from multiple simulated cores are synchronized on every committed instruction. As 
multiple simulated cores share the same host pipeline, enforcing this functional synchronization 
is lightweight. The primary downside of these centralized approaches is the performance overhead 
incurred when the number of components/events grows large. 


3.3.2 DECENTRALIZED BARRIER SYNCHRONIZATION IN AN 
FPGA-ACCELERATED SIMULATOR 


Alternatively, it is possible to use a decentralized scheme, where events are synchronized only 
in relation to the subset of events required to compute a new event. These distributed schemes 
use Zimed ports similar to the software ports found in the Asim simulator. Each port represents 
a communication link in the target design, carrying a set of timed messages between simulator 
components. Ports are uni-directional, point-to-point links whose primary purpose is to ensure 
messages generated at higher target-cycles are not consumed too early at lower target-cycles. If for 
example, every cycle at least one token is pushed through a port, one can synchronize a component 
with two inputs by simply waiting for both inputs to have tokens from the next target time before 
proceeding to that time. 

The HAsim and FAST port designs [10, 33] are examples of this form of distributed syn- 
chronization of performance events. ‘These schemes allow greater flexibility and scalability as the 
total number of ports in any given module is generally much smaller than the total number of 
ports/events in the entire target model itself. 


3.4 FPGA SIMULATOR PROGRAMMABILITY 


FPGA-accelerated simulators offer many advantages in terms of speed and scale over software, 
but may require end users to implement a significant portion of their simulation models using 
hardware description languages (HDLs) such as Verilog or VHDL. Conventional Verilog or 
VHDL can be tedious and error prone to write, limiting the rate at which end users can introduce 
changes to their functional and timing models. 

Today, there are a growing number of options for bridging the gap between programming 
in high level software versus programmable hardware. For example, Chisel is an open-sourced 
hardware description language based on the Scala functional programming language [12] that 
could also be used to improve the agility of FPGA-based simulation users. Commercial FPGA 
tools have also begun supporting high level synthesis tools such as OpenCL-to-gates (Altera [32]) 
and C-to-gates (VivadoHLS from Xilinx [48]). 
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As another example, a significant portion of the ProtoFlex full-system functional simula- 
tor was developed in just under one man-year using BSV [13]. The ProtoFlex functional engine 
was a non-trivial implementation effort targeting a 16-way multithreaded UltraSPARC III pro- 
cessor. In BSV, the use of guarded atomic actions allowed the simulator implementation to be 
abstracted at a high level closer to the design specification without limiting the ability to create 
high quality implementations. Other FPGA-accelerated simulation projects such as FAST [10] 
and HAsim [33] have also adopted the use of BSV. 


35 CASE STUDY: FPGA-ACCELERATED SIMULATION 
TECHNOLOGIES (FAST) 


The FPGA Accelerated Simulation Technologies (FAST) project first proposed the speculative 
functional first simulator architecture and developed the first SFF simulators. The first FAST 
simulator, FAST-UP, simulated a unicore two-issue out-of-order x86-based computer with eight- 
way 32KB L1 instruction and data caches and a 256KB shared L2 cache. It supported 64 ROB 
entries, 16 shared reservation stations, 16 load/store queue entries, a four-way, 8K BTB gshare 
branch predictor, and up to four nested outstanding branches. 

FAST-UP’s timing model was structurally partitioned using parameterized timing model 
components. Common components were provided in libraries and instantiated as needed. Com- 
ponents were connected together with FAST Connectors [11] that each provided configurable 
latency, bandwidth, and throughput and a shared buffer. Timing was controlled in a distributed 
way through the connectors. 

The functional model was QEMU [3] that was highly modified to introduce instruction 
trace generation, checkpoint, and rollback. It was able to boot both Linux and Windows and 
running interactive Microsoft Word and YouTube on Internet Explorer [10, 43] and was designed 
to incorporate fast and accurate power models as well [44]. 

FAST-UP ran on a DRC Computer development platform that contained a dual socket 
motherboard, where one socket contained an AMD Opteron 275 and the other socket contained 
a Virtex4 LX200 FPGA. The CPU and FPGA communicated over HyperTransport. The timing 
model was untuned, since we used over 30 host cycles per target cycle, and was the simulator 
bottleneck resulting in roughly 1MIPS—3MIPS, depending on branch prediction accuracy. 

‘The current version of FAST, FAST-MP, has a highly tuned multithreaded timing model 
(see Chapter 4) that runs at 100MHz, supporting up to 100MIPS. Thus, it uses centralized syn- 
chronization. It runs on a single Virtex 6 240 FPGA connected to the host CPU via PCIe. The 
functional model is a parallelized version of QEMU with all of the hooks necessary to func- 
tionally and performance-accurately simulate a multicore x86 system, including I/O. It has been 
parallelized, with 90%+ efficiency and is running on a six core Intel processor. ‘The aggregate func- 
tional performance is about 25 MIPS/host core including all support for tracing, checkpoint, and 
rollback. The functional performance can be efficiently shared between all of the target cores. 
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CHAPTER 4 


Simulation Virtualization 


As discussed in Chapter 1, there is no strict requirement that a structural correspondence exists 
between the target system and what is actually implemented on an FPGA. Given this relaxed 
demand for structural fidelity, a well-engineered FPGA-accelerated simulator should achieve, 
in comparison to a structurally accurate prototype, a much higher simulation rate (measured in 
instruction count or other architecturally visible metrics) and incur lower design effort and logic 
resources. 

This chapter discusses the significant benefits that can arise from harnessing FPGAs not 
as a hardware prototyping substrate but as a virtualizable compute resource for executing and 
accelerating simulations. In particular, this section examines two key virtualization techniques 
developed and utilized by the ProtoF lex project [14] for accelerating fu//-syszem and multiprocessor 
simulations. 

‘The first virtualization technique is Hierarchical Simulation with Transplanting for simplify- 
ing the construction of an FPGA-accelerated full-system simulator. In Hierarchical Simulation, 
one accelerates in FPGAs only the subset of the most frequently encountered behaviors (e.g., 
ALU and load/store instructions) and relies on a reference software simulator to support simula- 
tions of rare and complex behaviors (e.g., system-level instructions and I/O devices.) 

The second technique is time-multiplexed virtualization of multiple processor contexts onto 
fewer high-performance multiple-context simulation engines. Simulation virtualization decou- 
ples the required complexity and scale of the physical hardware on FPGAs from the complexity 
and scale of the target multiprocessor system. Unlike a direct prototype, the scale of the accel- 
eration hardware on the FPGA host is an engineering decision that can be set judiciously in 
accordance with the desired level of simulation throughput. Before delving into the details of the 
two virtualization techniques, the next section first explains the background and requirements of 
full-system and multiprocessor simulations. 


41  FULL-SYSTEM AND MULTIPROCESSOR SIMULATION 


In addition to modeling processors and memory, full-system simulators model a complete system, 
including system-dependent behaviors, I/O, and peripherals. The intent is to model a system 
to a sufficient degree of architectural-level fidelity such that real-world software, e.g., operating 
systems and commercial workloads can run without modification or re-compilation. This category 
of simulation is important when exploring architectural-level features with system-wide effects 
that cannot be studied or demonstrated with simplified, I/O-less, user-level benchmarking such 
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as OLIP transaction processing. The importance of full-system simulation is underscored by the 
large number of available software-based full-system simulators (e.g., Simics [26], QEMU [3], 
SimNow [41], etc.) many of which can be augmented with performance models. 

Developing a full-system simulator is a complex endeavor due to the completeness of the 
model that must be captured. However, the extensiveness of what needs to be incorporated into 
the simulator does not directly impact simulation speed. It should not be surprising that, in a 
software-based full system simulator, most of the simulation time is consumed by emulating in- 
struction execution and memory accesses and not I/O due to the rarity of I/O events incurred by 
the devices relative to the rate of CPU processing events. In a given period of simulated time, the 
number of simulated events stemming from instruction execution and related memory accesses 
dwarfs everything else that occurs in a full-system simulation. Consequently, implementing all 
peripheral and I/O subsystems in an FPGA, which increases implementation effort enormously, 
contributes little to simulation speed. It is important to note that many I/O devices can already 
be simulated faster than real-time using software (e.g., Disk Sim [5].) 

Using techniques such as binary re-writing and native execution, a single-threaded, 
software-based full-system simulation of a uniprocessor target system can run close to the speed 
of the real system. However, once such a simulator is instrumented (e.g., modeling a functional, 
trace-based cache hierarchy), it can incur slowdowns of 10X or more (as presented in [14, 31].) 

When simulating a multiprocessor system, the slowdown of a single-threaded software- 
based simulator grows at least linearly with the number of simulated processors. It may seem nat- 
ural to port the multiprocessor target simulation to a multiprocessor host to offset this slowdown, 
but parallelizing multiprocessor software simulators is far from being a solved problem. The fore- 
most challenge is that the scalability of distributed parallel simulation is limited if the simulated 
target components interact at a granularity (frequency and latency) below the communication 
granularity of the underlying host system [25, 35]. If nothing is done, the host communication 
latency introduces artificially large communication latency (in target simulated time), leading to 
unrealistic timing or interleaving of simulated target events. On the other hand, accounting for 
the effect of the host communication latency requires undesirable performance-robbing stalls be- 
tween dependent simulation events. 

Hardware-based acceleration using FPGAs offers an alternative to speeding up multipro- 
cessor simulation. The higher simulation rates from a single FPGA forestall the need to distribute 
the simulation. When scaling to distributed simulations, the hardware-level interactions allow 
for better proportioned simulation speed and communication delay. In the rest of this section, 
software-based or software-only simulation will refer specifically to single-threaded simulator ex- 
ecution. 
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42 HIERARCHICAL SIMULATION WITH 
TRANSPLANTING 


Hierarchical simulation with transplanting is motivated by the observation that the great major- 
ity of behaviors encountered dynamically make up a small subset of the total system's reachable 
set of behaviors. It is this small subset of behaviors that primarily determines the overall simula- 
tion performance. To improve simulation performance while minimizing hardware development, 
one should only apply FPGA acceleration to the components that exhibit the most frequently 
encountered behaviors. In hierarchical simulation, one should start from a software-based, full- 
system simulator that already exists to cover the total set of behaviors. Next, based on profiling, 
one chooses what is necessary to accelerate in the FPGA to achieve the goal of acceleration. It 
is not unreasonable to assume a software-based full-system simulator to exist as a starting point 
because if such a simulator did not exist, an FPGA-accelerated simulation project should begin 
by creating one as a first step. Not only is this the most tractable way to capture and debug the 
wide range of behaviors necessary, but it is also a crucial enabler in validating the FPGA-captured 
behaviors later on. 


4.2.1 HIERARCHICAL SIMULATION 


Figure 4.1 illustrates, at the conceptual level, the difference between a software-only and a hier- 
archical approach to full-system multiprocessor simulation. In hierarchical simulation, software 
and FPGA hosts are used concurrently to support the simulation of different parts of the full 
system. In this still simplified view of Hierarchical Simulation, all components are either hosted 
in hardware on the FPGA or simulated by the reference software simulator; specifically, the main 
memory and processors are implemented in the FPGA while the remaining components are re- 
tained in the reference software simulator (e.g., disk storage and network interfaces, etc.). Both 
the hardware-hosted and software-simulated components are advanced concurrently to model 
the progress of the complete target system. On one hand, when a processor invokes an I/O device 
(e.g., using memory-mapped I/O or DMA), the processor model simulated on the FPGA is in 
fact interacting with the software-simulated device in the software simulator. On the other hand, 
when a software-simulated DMA-capable I/O device accesses memory, the device accesses the 


DRAM memory modules on the FPGA host platform. 


4.2.2 TRANSPLANTING 


A complex component such as a processor encompasses a small set of frequent behaviors (ADDs, 
LOADs, TLB/cache accesses, etc.) and a much more extensive set of complicated and fortunately 
often also rare behaviors (privileged instructions, MMU activities, etc.). Assigning the complete 
set of processor behaviors statically to either the software simulation or FPGA simulation host 
would result in either the simulation being too slow or the FPGA development being too compli- 
cated. These conflicting goals can be reconciled by supporting transplantable components, which 


34 4. SIMULATION VIRTUALIZATION 


Memory 





(storage, graphics, network, user interface, etc.) 


(a) components modeled by a full-system software simulator 


Memory 





(storage, graphics, network, user interface, etc.) 


(b) components accelerated by FPGA (c) components remaining with the software simulator 


Figure 4.1: Partitioning a simulated target system across FPGA and software simulation in the 
ProtoF lex simulator. 


can be re-assigned to the FPGA host or software simulation dynamically at runtime during hybrid 
simulation. 

Continuing with the processor example, the FPGA would only implement the subset of 
the most frequently encountered instruction subset. When this partially implemented proces- 
sor encounters an unimplemented behavior (e.g., a page table walk following a TLB miss), the 
FPGA-hosted processor component is suspended and its processor state is transplanted (that is, 
copied) to its corresponding software-simulated processor model in the reference simulator. The 
software-simulated processor model, which supports the complete set of behaviors, is activated 
to carry out the unimplemented behavior. Afterward, the processor state is transplanted back to 
the FPGA-hosted processor model to resume accelerated execution of common case behaviors. 


4.2.3 HIERARCHICAL TRANSPLANTING 


A full transplant from the FPGA to the full-system simulation host can incur high cost, from 
microseconds (PCI-E latency) to milliseconds (Ethernet latency) between the FPGA and the 
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software simulation hosts, depending on how the software host and the FPGA host are intercon- 
nected. Such a high transplant cost would require even relatively rare behaviors to be implemented 
in the FPGA (to drive down the frequency of required transplants.) 

Consider a scenario with a 100 MHz FPGA host capable of simulating one instruction 
per cycle—in other words 100 MIPS—for the supported instructions. Further assume a 99.999% 
dynamic instruction coverage, that is, only one transplant is required per 100,000 instructions 
executed. If the transplant latency is one millisecond, the average execution time per 100,000 
instruction becomes two milliseconds, halving the throughput to 50 MIPS. ‘The effective average 
instruction execution time can be expressed as 


T effective = Tby-FPGA + Rmiss x Tpy-txplant 


Tby-rpGA is the time required to execute one instruction on the FPGA host or the time to deter- 
mine that it is an unsupported instruction. Tpy—txplant is the time required to execute one instruction 
by the software-host, including the transplant latency. R miss is the percentage of dynamic instruc- 
tions that is not supported by the FPGA host. In the example scenario above, T,,_rpga=10 nsec; 
Tby—txplanı=1 msec and Riss=0.00001. 

The equation above should be strongly reminiscent of the effective memory access time 
through a cache. This interesting parallel points to a simple, yet effective solution. Just as computer 
architects would introduce more levels of cache hierarchies to bridge the gap between processor 
and DRAM speed (as oppose to building bigger caches or building faster DRAMs), one can sim- 
ilarly introduce a hierarchy of intermediate software transplant hosts with staggered, increasing 
instruction coverage and performance costs. For example, today’s FPGAs can support embedded 
processors realized as either soft- or hard-logic cores which can execute a software simulation 
kernel for the entire processor behaviors. The simulation on the embedded processor is still slow 
relative to the FPGA-hosted instructions but incur much less cost than a full transplant to the 
full-system software simulator. At the same time when writing a software simulation kernel, it 
is much easier to capture enough, if not all, of the processor behaviors to achieve a sufficiently 
higher dynamic instruction coverage to reduce the number of times one needs to pay the full cost 
of transplanting to the external software-host. If all of the processor behaviors are captured by 
the software simulation kernel running on the embedded processor core, the reference software 
simulator is relegated to providing simulation support of the I/O subsystem only. 

To complete the analogy with hierarchical caches, the effective average instruction execu- 
tion time of two levels can be expressed as 


T effective = Tby-FPGA + Ryiss—FPGA x Tpy— utxplant—effective 


Tby-—utxplant—effective = Tby—utxplant 23 R iss utxplant(filtered) x Thy—tplant 


In the above, utransplant/microtransplant refers to the intermediate transplant to an interme- 
diate embedded simulation kernel on the FPGA. Suppose the average execution time of an in- 
struction by the embedded kernel 77,,,;,,:,,:; is 10 usec, or 1,000 times slower than T», rpg. 
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If we suppose the completeness of the embedded software simulation kernel misses only one 
in 1,000,000 instructions, the filter miss-rate at the embedded software simulation kernel is 
thus 10%, 7))—jctransplant—effective=0.11 msec. The resulting overall Tofecrivo with microtransplant is 
11.1 nsec or only 11% more than if everything were executed by FPGA. Keep in mind, to re- 
duce Riss from one in 100,000 to one in 1,000,000 would require a disproportionate increase in 
the completeness of the processor modeling—one practically would have to implement the en- 
tire processor at that point. This is much more easily done in an embedded software-simulation 
kernel than trying to capture the processor model completely in the FPGA. Even if one were to 
undertake the herculean effort (in terms of both design time and logic resources) of completing 
the processor modeling entirely by FPGA, it would only result in a 1% performance gain over 
hierarchical transplanting in the example scenario. 


43 VIRTUALIZED SIMULATION OF MULTIPROCESSORS 


As stated earlier, a simple but impractical approach to constructing an N-way multiprocessor 
simulator in an FPGA is to replicate N cores and integrate them together with a large-scale in- 
terconnection substrate. Although this meets the requirement of simulating a large-scale system, 
the development effort and required logic resources would be prohibitive when N is more than just 
a handful. The advantage of this large hardware implementation effort of course is the aggregate 
simulation throughput of N cores. The question is: if one were willing to tolerate less performance, 
while still achieving orders-of-magnitude gain over conventional software-based simulation, can 
one trade the excess performance for a significantly reduced hardware implementation effort? 


4.3.1 TIME-MULTIPLEXED VIRTUALIZATION 


Time-multiplexed virtualization offers a performance-driven approach that trades the excess 
simulation performance for a more tractable hardware development effort and cost. In time- 
multiplexed virtualization, as the name implies, a single resource is used to simulate multiple 
virtual copies in a time-multiplexed fashion. ‘This technique is especially useful in supporting the 
simulation of a multiprocessor target where the multiple processor contexts can be readily mapped 
onto a single, fast multiple-context engine. This virtualization decouples the scale of the target 
simulated system from that of the FPGA host platform and the hardware development effort. The 
scale of the FPGA simulation platform is only a function of the desired simulation throughput 
(i.e., achieved by scaling up the number of engines.) For example, Figure 4.2 illustrates concep- 
tually a large-scale multiprocessor simulator where multiple simulated processors in a large-scale 
target system are shown mapped to share a small number of engines. 

In Figure 4.2, the multiple-context pipeline is augmented with instruction-interleaved mul- 
tithreading support, in which an instruction is issued from a rotating set of processor contexts on 
each cycle. An interleaved pipeline enjoys the same implementation advantages as multithreaded 
pipelines found in the CDC Cyber [47] and HEP [42]. With enough available processor con- 
texts to keep the engine occupied, it is possible to design a deep pipeline without the ill effects 
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Figure 4.2: Large-scale multiprocessor simulation using a small number of multiple-context inter- 
leaved engines. 


of data-dependent stalls. Moreover, a context blocked by long-latency events such as accesses to 
main memory or a transplant operation can be taken out of the scheduler to allow other contexts 
to do useful work. 

In the structurally accurate prototyping approach, besides the logic resources needed to in- 
stantiate N copies of the processor cores, substantial resources must also be devoted to a high 
performance high-endpoint interconnection. For a shared memory multiprocessor, this may even 
entail deploying a high performance cache-coherent shared-memory hierarchy so that the cores 
can execute concurrently (even though the goal is only architectural simulation.) When many 
target processor contexts are multiplexed onto a single simulation engine, this complexity is au- 
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tomatically reduced. The multiple target processor contexts sharing a pipeline automatically sees 
coherent shared memory through the common cache. 

In theory, one could achieve greater scale in the target system by increasing the degree of 
interleaving. However, the added number of contexts beyond a certain limit would prohibitively 
degrade per-CPU simulation throughput. A second dimension of scaling is to grow the number of 
interleaved engines (and hence FPGAs.) Scaling across both dimensions (virtual interleaving and 
physical replication) introduces new complexities. At a basic level, new infrastructure is needed for 
communication and memory sharing between the multiple distributed simulation engines. The 
consolidation provided by time-multiplexing still helps by reducing the number of simulated host 
pipelines connected through distributed shared memory, a number that should be much smaller 
than the number of simulated target nodes. 


4.3.2 VIRTUALIZING MEMORY CAPACITY 


When simulating a large-scale multiprocessor system, a large number of target processors could be 
mapped onto affordable FPGA logic resources via time-multiplexing, provided the commensurate 
slow down is acceptable. The same time-multiplexing approach, unfortunately, is not applicable 
when virtualizing the required amount of DRAM capacity. 

While actual total capacity cannot be short changed, it is possible to use hierarchical mem- 
ory techniques to achieve the appearance of a larger DRAM using a slower, higher-capacity back- 
ing store. Specifically memory nodes in the host system could be used as a cache of a larger 
backing disk storage that contains the full contents of the target system's main memory. ‘This is 
not unlike software simulators modeling memory systems of much larger capacity than available 
on the workstation host by means of standard OS demand-paging from a slower backing disk 
storage. Keep in mind that the DRAM and disks of the target run at native speed while the sim- 
ulated target processors run at an order of magnitude slowdown. ‘This helps absorb the effects of 
memory virtualization even when the simulation experiences poor locality of reference. Akin to 
processor virtualization, memory virtualization also enables us to tradeoff between the resources 
expended in the host memory and the desired level of simulation performance. 


44 CASE STUDY: THE PROTOFLEX SIMULATOR 


The ideas presented in this chapter have been realized in the ProtoFlex full-system architectural 
simulator modeled after the SunFire 3800 server [14]. The ProtoFlex simulator uses Simics [26] 
running on a standard PC workstation as the reference simulator and incorporates a single Xilinx 
XC2VP70 FPGA for acceleration. This simulator faithfully models a 16-way symmetrical mul- 
tiprocessing (SMP) UltraSPARC III server to such a degree that it is capable of booting Solaris 
8 and running commercial workloads such as Oracle On-Line Transaction Processing (OLTP.) 
At the time of this work, the FPGA acceleration resulted in 49x speed up over the reference 
Simics simulator, a contemporary state-of-the-art software-only simulator. By decoupling the 
complexity of the target system from what is required to be implemented in FPGA, the complete 


4.4. CASE STUDY: THEPROTOFLEX SIMULATOR 39 


FPGA-accelerated simulation system was developed by one graduate student in just a little over 
one year. 


4.41 PROTOFLEX DESIGN OVERVIEW 


The design of the ProtoFlex FPGA-accelerated architectural simulator has three objectives. The 
first objective is to simulate large-scale multiprocessor systems with an acceptable slowdown 
(«100x.) The second objective is to model full-system fidelity for executing realistic workloads in- 
cluding unmodified operating systems. The third goal is to lower the development effort and cost 
to a justifiable level in an academic computer architecture research setting. Very explicitly, it was 
never a goal to capture the accurate structure or sub-instruction granularity timing of the target 
multiprocessor system. From these goals, it follows quite naturally to use FPGA as a virtualizable 
resource for simulation execution and not for implementing the simulated target. 

Figure 4.3 shows a high-level block diagram of how the functionality of the target 16- 
way SMP server is mapped onto software simulation versus the FPGA hosts. The main mem- 
ory system is hosted directly by DRAM modules on the FPGA host. All 16 target processors 
are mapped onto a single multi-context BlueSPARC simulation engine contained on one Xilinx 
Virtex-II XC2VP70 FPGA [13]. The interleaved BlueSPARC pipeline is capable of transplant- 
ing any one of the 16 processor contexts to the software simulator (while the remaining contexts 
continue unimpeded) on encountering an unimplemented UltraSPARC III behavior. In addition 
to the interleaved pipeline, the nearby PPC405 processor embedded in the FPGA serves as the 
microtransplant host (Section 4.2.3). The reference Simics simulator running on a PC worksta- 
tion, connected to the FPGA host by Ethernet, provides the third hosting option for the target 
systems I/O subsystem. The ProtoFlex simulator leverages the built-in API of Simics to issue 
I/O accesses to simulated devices such as disks. 


4.4.2 BLUESPARC PIPELINE 


The FPGA-acceleration portion of the Hierarchical ProtoF lex simulator is hosted on a Berke- 
ley Emulation Engine 2 (BEE2) FPGA platform [6]. One Xilinx Virtex-II XC2VP70 FPGA is 
used to implement the BlueSPARC 16-context simulation pipeline (Figure 4.4) [13]. The BlueS- 
PARC engine is a 14-stage, instruction-interleaved pipeline that supports the multithreaded ex- 
ecution of up to 16 UltraSPARC III processor contexts (Section 4.3.1). Table 4.1 summarizes 
the most salient characteristics of the BlueSPARC pipeline. The maximum retirement rate of the 
BlueSPARC pipeline is nominally one instruction per cycle, which in combination with its clock 
frequency dictates the ProtoF lex simulator’s peak simulation throughput. 

The design of the BlueSPARC engine is optimized first and foremost to: (1) ensure correct- 
ness, (2) maximize maintainability for future design exploration, and (3) minimize effort. In many 
cases, the design of the BlueSPARC engine allowed the designer to forgo complex performance 
optimizations in favor of a simpler, more maintainable design. Recall that hierarchical simulation 
allows the designer to omit rare behaviors from the FPGA implementation. Table 4.2 summarizes 
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Table 4.1: BlueSPARC pipeline characteristics 


16 64-bit UltraSPARC IIl Contexts 
14-stage interleaved pipeline 


64KB I-cache, 64KB D-cache, 64B, direct-mapped 
Writeback, Non-blocking load/store, allocate-on-write 
16 outstanding misses, 4-entry store buffer 


90MHz 
4GB total memory 


33,508 LUTs (50%), 222 BRAMs (67%) 


42,206 LUTs (65%), 238 BRAMs (72%) 


Bluespec System Verilog 
Xilinx EDK 9.2i, XST 9.2i 


25K lines Bluespec, 511 rules, 89 module types 


Table 4.2: Assignment of target behavior to simulation host (FPGA, microtransplant, full-transplant) 


+ add/sub/shift/logical/multiply/divide instructions 
* register windows and associated traps 

+ 38 (of 103) SPARC ASI instructions 

* interprocessor interrupt cross-calls 

+ device and software interrupts 

+ l- and D-MMU (including software TLB) 

* loads, stores, atomics (plus inverse-endian modes) 
+ VIS block memory instructions 

* 
* 
+ 


65 (of 103) SPARC ASI instructions 

VIS I/II multimedia instructions 

floating point add/sub/mul/div and associated 
traps 

floating point to/from integer conversion 
alignment instructions 

fixed-point arithmetic instructions 

TLB/cache diagnostic instructions 

TLB de-mapping operations 


PCI bus, ISP2200 Fibre Channel, i21152 PCI Bridge 
IRQ Bus, Text console, SBBC PCI device 

Serengeti IO PROM, Cheerio-hme NIC 

Fibre Channel SCSI Disk, SCSI cdrom, SCSI bus 
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Figure 4.3: Allocating components for hierarchical simulation in the BlueSPARC simulator. 


how the various UltraSPARC III behaviors are assigned to the three hosting options—FPGA, 
embedded PowerPC microtransplant, PC full-transplant. These assignment decisions were made 
based on rigorous dynamic instruction profiling of various applications simulated in Simics. With 
hierarchical transplanting, only 99.95% of the dynamic instructions are executed in hardware on 
the FPGA while the remainder is carried out in the microtransplant kernel (running on the em- 


bedded PowerPC) and software full-system PC host. 


4.4.» PERFORMANCE EVALUATION 


‘This section presents the performance evaluation of the ProtoF lex simulator using software work- 
loads comprising five SPECINT 2000 benchmarks (crafty, gcc, vortex, parser, bzip2) and an On- 
Line Transaction Processing (OLTP) benchmark. For the SPECINT workloads, 16 copies of the 
program are executed concurrently; each experiment measures simulation throughput for 100 bil- 
lion aggregate instructions (after initialization phase.) For OLTP, the simulated server runs the 
Oracle 10g Enterprise Database Server configured with 100 warehouses (10GB), 16 clients, and 
1.4 GB SGA. Each experiment measures throughput for 100 billion aggregate instructions in 
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Table 4.3: Performance comparison 


jobs Simics-fast Simics-trace 

Oracle-TPCC-16 50.5 1255 nd 
Bzip2-16 Sos 42.2 dll 
Crafty-16 45.1 SUM ÉS 
GCC-16 62.4 32.4 118 
Gzip-16 47.2 37.6 r3 
Parser-16 44.4 34.1 qu 
Vortex-16 26.1 35.6 dls dl 
Average 44.8 36.5 172 


a steady-state execution (where database transactions are committing steadily.) As shown next, 
the workloads' characteristics have a large effect on the throughput of both the Simics and the 
ProtoFlex simulator. 

When Simics is invoked with the default "fast" option, it achieves tens of MIPS in simu- 
lation throughput. However, there is roughly a factor of 10x reduction in simulation throughput 
when Simics is enabled with trace callbacks for instrumentation [31], such as memory address 
tracing. The two columns in Table 4.3 labeled Simics-fast and Simics-trace report Simics through- 
put for the simulated 16-way SMP server. Simics simulations were run on a Linux PC workstation 
with a 2.0 GHz Core 2 Duo and 8 GBytes of memory. The performance most relevant to archi- 
tecture research activities is represented by the performance Simics-trace column. The simulation 
throughput of the ProtoFlex simulator is reported in the left-most column of Table 4.3. For these 
measurements, the BlueSPARC engine is clocked at 90MHz. The ProtoFlex simulator achieves 
speed comparable to Simics-fast on the SPECINT and Oracle-TPCC workloads. In compari- 
son to the more relevant Simics-trace performance, the speedup is more dramatic, on average 38x 
faster. 


4.4.4 HIERARCHICAL SIMULATION AND VIRTUALIZATION INA 
PERFORMANCE SIMULATOR 


Though we discuss hierarchical simulation and virtualization in the context of ProtoFlex, which 
is a functional-only simulator, the techniques apply to performance simulators as well. 

To address the issue of functional model complexity on an FPGA, FPGA-based timing- 
directed simulators that intend to be fairly complete can compose with the transplant capabilities 
pioneered by ProtoFlex. For example, HAsim [33] adopted both transplant and multithreading 
to provide full system functionality and target scalability. Whenever an instruction that is not 
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implemented in the timing-directed simulator is encountered, HAsim flushes the pipeline and 
transplants the processor state to a software functional model that executes that instruction and 
returns the updated state to the timing-directed simulator. Because the instruction is, however, 
executed in a software functional model and not executed within the timing model, inaccuracy is 
introduced whenever there is a transplant. 

In addition, the software functional model must have access to the state of the simulation, 
including memory values, requiring the FPGA and the CPU to share memory to some degree. 
That can be achieved by having true shared memory in the underlying host, or by moving data 
values explicitly between the CPU and FPGA as needed. There are FPGA platforms that already 
allow the CPU and FPGA to share memory, such as the Intel/Nallatech ACP system that places 
an FPGA into an Intel front-side bus socket, allowing its requests to be snooped by the FPGA, 
the DRC/XtremeData systems that places an FPGA into a HyperTransport socket, the Xilinx 
Zynq platform that enables the FPGA to access the embedded ARM cores’ cache, the Intel 
QuickPath Interconnect that enables an FPGA to attach to the Intel QPI interconnect, and 
the IBM Power8 CAPI interface that provides a coherent interface between the FPGA and the 


Power8 processor. 
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CHAPTER 3 


Categorizing FPGA-based 


Simulators 


To summarize, there are three high-level orthogonal characteristics of FPGA-accelerated simu- 
lators: (1) simulator architecture, (2) the partitioning between FPGA and software host, and (3) 
providing virtualization support within the simulator to better utilize host resources to better sup- 
port simulation of targets. To review, the five simulator architectures described in Chapter 3 are 
monolithic, timing-directed, functional-first, timing-first, and speculative functional-first. Parti- 
tioning between the FPGA and software host refers to partitioning between a software functional 
model and an FPGA timing model or an FPGA-based common-case functional model and a 
complete software functional model. Virtualization refers to techniques such as multithreading 
that enable multiple target components to share the same host resources. 

Designing an FPGA-based simulator requires selecting a number of points in the design 
space, ranging from the simulator architecture to the particular optimization strategies used to 
cope with the restrictions of hardware-based accelerated simulation. Table 5.1 summarizes a few 
of the common FPGA-based simulator artifacts along with the selected choices in terms of sim- 
ulator architecture, partition mapping, synchronization, schemes, and optimization techniques. 


5.1 FAME CLASSIFICATIONS 


The FAME [46] classification also defines three characteristics: direct versus decoupled, RTL 
versus abstract machine, and single-threaded versus multithreaded. Direct versus decoupled in- 
dicates whether one host cycle is used to simulate a single target cycle (direct) or multiple host 
cycles are used to simulate a single target cycle (decoupled.) RTL versus abstract machine is what 
we call prototype versus simulator. (We do not consider the prototype version to be a true sim- 
ulator.) Single-threaded versus multi-threaded refers to the use of multi-threading techniques at 
the simulator level to tolerate host latencies such as access to host DRAM. Multi-threaded is tied 
to decoupled, since one cannot use a single host cycle per target cycle if one is multi-threading 
the simulator. 


5.20 OPEN-SOURCED FPGA-BASED SIMULATORS 


To close this chapter, this section briefly summarizes describes several open-sourced FPGA-based 
simulators that employ the simulation techniques covered in this manuscript. 
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Table 5.1: Summary ofvarious FPGA simulator artifacts in terms of basic simulator architecture and 


key characteristics of partition- mapping, synchronization, and FPGA optimization techniques 


NS Simulator Partition Synchronization Optimization 
Architecture Mapping Scheme Techniques 
Functional- FM(fpga) Centralized, functional Multi-threading 

Protoflex E > : : 
Only TM(sw) contexts Hierarchical Simulation 
Speculative FM(sw) l l 

FAST-UP Functional- Decentralized, port level Functional Speculation 

: TM(fpga) 
First 
à Timing- FM(fpga) " : : 

Hasim rd TM(fpga) Decentralized, port level Multi-threading 
Timing- FM(fpga) Decentralized, model : : 

RAMP-Gold Direc TM(fpga) Be Multi-threading 

DART Monolithic NOC(fpga) Centralized 

5.2.1 PROTOFLEX 


As discussed in the last chapter, the ProtoFlex simulator was developed at Carnegie Mellon Uni- 
versity to support FPGA-accelerated functional simulation of full-system, large-scale multipro- 
cessor systems [14]. The ProtoFlex functional model targets a 64-bit UltraSPARC III ISA (com- 
pliant with the commercially available software-based full-system simulator model from Sim- 
ics [26]) and is capable of booting commercial operating systems such as Solaris 10 and running 
commercial workloads (with no available source code) such as Oracle TPC-C. ProtoFlex was 
the first system to introduce hierarchical simulation and host multithreading as techniques for 
reducing the complexity of simulator development and to virtualize finite hardware resources. 
The ProtoF lex simulator is available at [36] and targets the XUPV5-LX110T platform, a widely 
available and low-cost commodity FPGA platform. 


5.2.2 HASIM 


The HAsim project was developed at MIT and Intel and employs the use of host multithreading, 
hierarchical simulation, and timing-directed simulation with functional-timing partition. HAsim 
currently supports the Alpha ISA and has been used to target a Nallatech ACP accelerator with 
a Xilinx Virtex 5 LX330T FPGA connected to Intel’s Front-Side Bus protocol. HAsim has been 
used to simulate a detailed 4x4 multicore with 64-bit Alpha out-of-order processors on a single 


FPGA. HAsim is available for download at [20]. 
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5.2.3 RAMPGOLD 


RAMP Gold is a simulator of a 64-core 32-bit SPARC V8 target developed at UC Berkeley. 
The first implementation was done on an Xilinx XUPV5 board. RAMP Gold employs host mul- 
tithreading and a functional first simulator and is capable of booting Linux. Ihe RAMP Gold 
package is available at [37] and includes CPU, cache, and DRAM timing models. RAMP Gold 
simulators were aggregated together onto 24 Xilinx FPGAs in the Diablo project that has been 
used to reproduce effects-at-scale such as TCP Incast [45]. 
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CHAPTER 6 


Conclusion 


This book describes techniques for practical and efficient simulation of computer systems using 
FPGAs. There is a distinction between using FPGAs as a vehicle for simulation and the use 
of FPGAs for prototyping. FPGA-accelerated simulation implies that a significant portion of 
the simulation is implemented in software, and that at least part of the simulator is structurally 
different than the target. 

This manuscript surveys simulator architectures and describes how different simulator ar- 
chitectures have been accelerated with FPGAs. One simulator architecture in particular, specu- 
lative functional first, was designed from the ground up to enable FPGA acceleration of perfor- 
mance simulators. Though SFF is not limited to software-based functional models and FPGA- 
based timing models, SFF provides many advantages including completeness, reduced FPGA 
resources, and the ability to tolerate latency. FAST-UP was the first implementation of an SFF 
simulator that simulated a dual-issue, branch-predicted, out-of-order x86-based computer in suf- 
ficient detail to boot both Linux and Windows and running interactive Microsoft Word and 
YouTube on Internet Explorer. The FPGA-based timing model was the bottleneck. FAST-MP 
leverages both SFF and FPGA multithreading to build a 256-core target, supported a branch 
predicted seven-stage pipeline that is also intended to boot Linux and Windows while running 
arbitrary off-the-shelf software. 

This book also describes hierarchical simulation that implements commonly used func- 
tionality on the FPGA, and less commonly used functionality in software. In addition, FPGA 
virtualization enables the mapping of multiple virtual components, such as a CPU, onto a sin- 
gle physical execution engine. ProtoFlex’s BlueSPARC leverages both techniques to provide an 
FPGA-accelerated, full system functional model that is capable of functionally simulating sixteen 
64-bit UltraSPARC V9 cores at 9OMHz on a single FPGA coupled to a microprocessor. 

Because SFF is intended for performance simulation and because its functional model and 
timing model can both be parallelized, it occupies a different point in the simulator space than hi- 
erarchical simulation described in this book. Rather than accelerating functionality on the FPGA, 
it places all of the functionality in software, where it can run very quickly due to a fast baseline 
simulator, and most of the timing is carried out in the FPGA. ProtoFlex/BlueSPARC accelerate 
common functional instructions in the FPGA, and transplanting to software is only necessary to 
provide full functionality. In both cases, however, the optimal host platform/simulator is used to 
maximize performance. 
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In conclusion, FPGA-accelerated simulators are highly performant while providing many 
of the desirable simulator benefits including accuracy, completeness, and usability. FPGA- 
accelerated simulators' main challenge is ease of programmability. The overall promise of FPGA- 
accelerated simulators, however, is a compelling reason to continue researching the area. 
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APPENDIX A 


Field Programmable Gate 
Arrays 


Field Programmable Gate Arrays (FPGAs) are VLSI devices that contain large numbers of pro- 
grammable logic elements, registers, memories, and configurable routing connect the outputs of 
components to the inputs of other components as specified by the programmer. These powerful 
devices can be used for a wide range of applications, including applications that have traditionally 
been thought of as being only possible to efficiently implement a general purpose microprocessor. 
This appendix briefly describes the internal structure of an FPGA. Note that FPGA architecture 
is fairly specific to a particular manufacturer. 


A.1 PROGRAMMABLE LOGIC ELEMENTS 


Programmable logic elements are implemented in small memories called look up tables (LUT). 
Any arbitrary two input gate can be implemented in a four-bit memory with two inputs, that 
specify the address, and one output. Current FPGAs from Xilinx and Altera, the two largest 
FPGA manufacturers, use larger-input LUTs that are much more powerful than a two-input 
LUT. For example, a four-input mux can be implemented in a single 6-input LUT. 

LUTs are bundled with other functionality such as adders, registers, and the ability to select 
from different clocks to form a larger block known as an Adaptive Logic Module (ALM) (Altera) 
or a slice (Xilinx). Figure A.1 and Figure A.2 are high-level and detailed drawings of an Altera 
ALM. Note that the interconnect is quite rich, enabling a huge number of different possible 
configurations of a single ALM (Figure A.3.) The underlying architectures and how they are 
mapped to are much of the secret sauce of FPGA vendors. 

LUTs can also be used as memory. LUT memory, called Memory Logic Array Blocks by Al- 
tera and distributed memory by Xilinx, can be specified in a variety of depths, widths, and number 
of ports. Structures around the LUTs, including the ALM/slice infrastructure, make the LUT 


memory more efficient. 


A.2 EMBEDDED SRAM BLOCKS 


FGPAs also contain block memories (BRAMs) that are generally two ported SRAMs. BRAMs 
from Altera are currently 20K bits large, while BRAMs from Xilinx are 36K bits large. BRAMs 
can generally be configured width-wise up to the native size of the BRAM. For example, the Al- 
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Figure A.1: A high-level depiction of an Altera Stratix V ALM. There are a pair of six-input LUTs 
(though they share inputs), a pair of adders, and four registers. Figure used with permission from 


Altera. 
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Figure A.2: A detailed depiction of an Altera Stratix V ALM. Note that each six-input LUT is 
implemented as a four-input LUT, a pair of three-input LU TS, and muxes. Figure used with permission 
from Altera. 
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Figure A.3: The different possible configurations of an Altera ALM. Figure used with permission 
from Altera. 
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tera BRAM can be configured in the following configurations, all dual-ported: 512x32b, 512x40b, 
1Kx16b, 1Kx20b, 2Kx8b, 2Kx10b, 4Kx4b, 4Kx5b, 8Kx2b, and 16Kx1b. 


A.3 HARD “MACROS” 


Modern FPGAs also contain Digital Signal Processing blocks that provide wide arithmetic func- 
tions such as adders and multipliers as well as more specialized blocks such as FIR filters. Current 
high-end FPGAs have thousands of such DSP blocks. 

Many modern FPGAs also contain embedded ARM cores. Currently shipping parts have 
dual ARM A9 MP cores, while FPGAs shipping within a year will have multiple 64-bit ARM 
cores running at 1.5GHz and up. FPGAs have become capable System-on-a-Chip (SoC) de- 
vices in their own right, where the ARM cores can be tightly integrated to custom hardware 
implemented within the FPGA reconfigurable fabric. 

Modern FPGAs contain extensive routing resources that can be thought of as a statically 
configured network (routing fabric) consisting of a huge number of statically configured switches 
in a mesh-like network, with local and more global routing resources. The connection points 
between the inputs and outputs and the routing fabric are also configurable. 
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